An Efficient Method for Parsing Large Finite Element Data Files, Study notes of Statistics

To this end, two algorithms have been developed. One is a new variant of the control volume finite element algorithm to simulate the isothermal flow of resin in ...

Typology: Study notes

2022/2023

Uploaded on 02/28/2023

yorket
yorket 🇺🇸

4.4

(38)

276 documents

1 / 34

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ARMY
RESEARCH
LABORATORY
An
Efficient
Method
for
Parsing
Large
Finite
Element
Data
Files
Dale
Shires
ARL-TR-974
March
1996
APPROVED
FOR
PUBLIC
RELEASE;
DISTRIBUTION
IS
UNLIMITED.
1027
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22

Partial preview of the text

Download An Efficient Method for Parsing Large Finite Element Data Files and more Study notes Statistics in PDF only on Docsity!

ARMY RESEARCH LABORATORY

An Efficient Method for Parsing

Large Finite Element Data Files

Dale Shires

ARL-TR-974 (^) March 1996

APPROVED FOR PUBLIC RELEASE; DISTRIBUTION IS UNLIMITED.

1027

DISCLAIMS! NOTICE

TfflS DOCUMENT IS BEST

QUALITY AVAILABLE. THE COPY

FURNISHED TO DTIC CONTAINED

A SIGNIFICANT NUMBER OF

PAGES WHICH DO NOT

REPRODUCE LEGIBLY.

REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704- Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sourcesgathering arid maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 JeffersonDavis Highway, Suite 1204, Arlington, VA 22202-4302, andto the Office of Management and Budget, Paperwork Reducation Project (0704-0188), Washington DC 20503

  1. AGENCY USE ONLY (Leave Blank) (^) 2. REPORT DATE March 1996 3. REPORT TYPE AND DATES COVERED Final, June 1995-September 1995
  2. TITLE AND SUBTITLE An Efficient Method for Parsing Large Finite Element Data Files
  3. FUNDING NUMBERS
  4. AUTHOR(S) PR: 1L162618AH

Dale Shires

  1. PERFORMING ORGANIZATION NAME(S) AND ADDRESSfES) U.S. Army Research Laboratory ATTN: AMSRL-SC-SM Aberdeen Proving Ground, MD 21005-

a PERFORMING ORGANIZATION REPORT NUMBER

ARL-TR-

  1. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESSfES) (^) 10. SPONSORING / MONITORING AGENCY REPORT NUMBER
  2. SUPPLEMENTARY NOTES

12a. DISTRIBUTION / AVAILABILITY STATEMENT

Approved for public release; distribution is unlimited.

12b. DISTRIBUTION CODE

1a ABSTRACT (Maximum 200 words)

The FORTRAN language has been and continues to be one of the most heavily used languages for imple- menting complex numerical algorithms. It is extremely efficient, and the straightforward constructs of the language allow today's FORTRAN compilers to take full advantage of the most novel parallel and vector computers. Many of these algorithms require large data sets to be read. This is the one task in which FORTRAN is somewhat lacking. While the main computation task should be left in FORTRAN, parsing tasks should be performed by applications with greater character stream access. Accessing and processing this character stream that is formed while the file is being read is known as parsing. This paper describes the tools available on most UNIX systems which can produce very fast parsers, the underlying methods employed by these tools, and ways these tools can be integrated with FORTRAN numerical solvers.

14. SUBJECT TERMS

parsing; finite element method

17. SECURITY CLASSIFICATION OF REPORT Unclassified NSN 754001-280- 18. SECURITY CLASSIFICATION OF THIS PAGE Unclassified 19. SECURITY CLASSIFICATION OF ABSTRACT Unclassified **15. NUMBER OF PAGES 27

  1. PRICE CODE
  2. LIMITATION OF ABSTRACT** UL Standard Form 298 (Rev. 2-89) 298-102^ Prescribed by ANSI Std. 239-

INTENTIONALLY LEFT BLANK

11

INTENTIONALLY LEFT BLANK

IV

List of Figures

1 Finite automaton recognizing a string with zero or more x characters ending with the character sequence yz... 3 2 A possible grammar specification 8 3 Restructured grammar using left-recursive rules 9

List of Tables

1 Some regular expression operators of Lex 3 2 Regular expressions for items in finite element mesh files 4 3 Parsing table 7 4 Parser actions 7 5 Results of parsing trials 11

v

1 Introduction

FORTRAN remains the language of choice for many complex numerical algorithms. The motivations behind the development of the language help to explain its longevity. Researchers early in the computer revolution were confined to writing numerical codes in assembly language. This practice required detailed knowledge of the algorithm as well as assembler and computer architecture specifics such as number of registers, memory structure, etc. The development of the FORTRAN language provided a watermark for both programming language and compiler designers. Advances in compiler design provided compiler writers the first opportunity to take a program written in a high-level language and generate assembly code of a caliber often exceed- ing hand-coded assembly. Codes began to be written in FORTRAN, at which point the computer specifics could be left to the compiler writers. FORTRAN remains popular today because it is highly efficient. The time required to execute many of these numerical codes is most often dominated by one or two small loops which perform the vast majority of the overall work of the algorithm. It is not uncommon to find one or two loops in these codes which consume upwards of seventy percent of the overall execution time. FORTRAN is very efficient at processing these loops. The simplicity of the language's loop structure is one of the main factors allowing for highly-optimized compiler-generated code. These loops can be executed with great speed with little overhead being incurred due to language constructs. While it is very efficient at number crunching, FORTRAN is somewhat lacking when it comes to file input and output. Often associated with these numerical codes are very large input files or data decks. The problem of interest for this research team provides an excellent example. This team is particularly involved with manufacturing simulation dealing especially with composite materials. To this end, two algorithms have been developed. One is a new variant of the control volume finite element algorithm to simulate the isothermal flow of resin in the resin transfer molding (RTM) composite manufacturing process [1]. The other is an implicit time-dependent pure finite element methodology for RTM flow simulation [2]. The majority of the work in both algorithms is performed in a few small FORTRAN loops. These codes perform very well on the new pipelined architecture found in the Silicon Graphics Power Challenge computer. However, parsing the input files is annoyingly slow and at times convoluted. This speed can be increased by taking input and output tasks away from languages like FORTRAN, which are limited in this area, and moving them to more robust byte-stream languages and libraries like those found and written in C. Furthermore, formalizing on one simple yet robust input format will also allow for faster reading. Combining regular expressions and a context-free grammar describing the structure of the input file makes it possible to create a deterministic finite automata for pattern recognition and a parser to interpret the structure of the file. Parsing of the input file is then bounded by O(n), where n is the size of the input deck. The techniques

mentioned previously were implemented to reduce the time required to parse finite element input files. This paper describes the implementation steps and the overall results of using this parsing technique.

2 Elements of fast parsing

There are several key issues which must be addressed in the course of defining a parsing strategy. What are the basic items in the data file? What is the basic structure of the data file? These issues are not unlike those historically encountered in the development of parsing strategies for compilers. They involve:

  • Defining the basic units of the data file. In this case, these items include in- stances of real numbers, integers, and character strings.
  • Formalizing a description of the format of the data. This is done by defining a grammar for the input data.
  • Establishing what to do with the data as they are being read. This requires establishing data structures and actions.

Often the best way to overcome a multifaceted problem such as this is to use the divide and conquer approach. This approach calls for us to solve each of these parts of the main problem separately. The methods are described in the following sections.

2.1 Lexical analysis Lexical analysis is the process of identifying the basic units of the data file. This process is accomplished by scanning the input stream, recognizing patterns in the data, and converting these patterns into tokens. These tokens are basically some classification for the patterns. For example, the sequence of characters "program" forms a string token and the sequence of numbers 531 forms a number token. These classifications are arbitrary and must be defined by the user. The process of building a pattern recognizer requires the construction of a tran- sition diagram referred to as a finite automaton. These finite automatons are state- transition diagrams. They tell the controlling algorithm how to act based on the current state it is in and on the next character in the input stream. The finite au- tomaton in figure 1 can accept a string with zero or more x characters ending with the sequence yz. A finite automaton can be either deterministic or nondeterministic. Nondeter- ministic automatons allow more than one transition out of a state on the same input symbol whereas deterministic automatons do not. There is a space-time tradeoff be- tween the two approaches. In general, deterministic finite automata allow for faster recognizers but require more space to define.

numbers, are fairly easy to define with regular expressions. FORTRAN real numbers are somewhat more involved and require several regular expressions to describe all the possible formats they may take. Table 2 lists the regular expressions used to define some of the items encountered in finite element mesh files.

Table 2: Regular expressions for items in finite element mesh files.

Token Regular expression^ Example comment 1111^ >J|C^ ^ File:mesh.data string [a-zA-Z]+ nodes integer [-+]?[0-9] + + real [-+]?"." [0-9]+ [-+]?[0-9]+"." [-+]?([0-9])+". "[0-9] + [-+]?( [0-9])"."([0-9]) [eE][-+]?([0-9]) + [-+]? ([0-9] ) * [eE] [-+]? ([0-9]) +

. -423. +35. 9.34e- 34e

2.1.1 Lexical analysis with Lex

The Lex specification file is given in appendix A. The beginning of the file lists several libraries that need to be included for various purposes such as string manipulation and input/output operations. Also listed are definitions for various local and global variables and function prototypes. Following this is the list of regular expressions for the finite element data file. This section follows the %} and ends with the first °/X White space is defined as any space, tab, or newline character. The definitions for letters and digits are straightforward. Integers have an optional sign followed by one or more digits. There are various definitions for real numbers to correspond with all allowed FORTRAN real formats. Strings are defined to be sequences of letters. Finally, a comment is defined to start with an * and comprise all characters until the end of the line. Next comes a list of actions that are to be performed when the regular expressions are matched. For integers, the string of characters is converted to an integer whose value is stored for the parser to use. The token integer is returned to the parser. For real numbers, a similar action is taken with a real token being returned to the parser. White space and comments result in no actions. All strings are first converted to upper case. A function is then called which scans a list of keywords, and if the string is a reserved word or keyword, returns a token for the keyword. Finally, any unmatched characters result in an error message being displayed. Following the second "/,'/, and continuing until the end of the file are the supporting functions. These functions perform various tasks such as converting strings from lower to upper case and checking a string to see if it is a keyword.

2.2 Syntax analysis and parsing

The input deck for the executing code must adhere to some rigid format to facilitate quick scanning. This format, or syntax, is best defined through the use of a context- free grammar, or grammar for short. A grammar naturally describes the syntactical structure of a language. Grammars can be very complex because of this. Indeed, they are most often used to define elaborate hierarchical and recursive constructs in programming languages. In this case, the format for an input deck, as well as the defining grammar, can be very simple. Context-free grammars consist of four components:

  1. A set of tokens, or terminal symbols. These are the items recognized and returned by the lexical analyzer.
  2. A set of nonterminals.
  3. A set of productions. These productions consist of a nonterminal on the left side, an arrow, and then a sequence of nonterminals or tokens on the right side.
  4. A nonterminal designated as the start symbol.

Historically grammars are specified by listing their productions with the start symbol listed first. Productions define the valid orderings of tokens in the file. Digits and boldface strings such as nodes are considered to be terminals. Italicized names are nonterminals and any nonitalicized names or symbols are tokens. If the nonter- minal on the left has more than one production, the right sides may be grouped and separated with the | symbol. For example, the grammar below may derive one item of the set of domestic animals {dog, cat} or one item of the set of wild animals {racoon, wolverine, bear}.

animals -> domestic | wild domesticdog | cat wild> racoon | wolverine | bear

The structure of the finite element input deck can be of a simplistic nature. For the isothermal filling algorithm, the vast majority of the file will be entries defining the grid points of the mesh and corresponding connectivity of these points. These entries are often referred to as nodes and elements, respectively. Other entries, such as material descriptors, may also be required. General purpose structural analysis programs have more functionality and usually support many data descriptors. For example, NASTRAN * supports over 100 data card descriptors [4]. Since we are more concerned with flow simulations, we focus on the two descriptors comprising the bulk of our data files. However, parser construction through grammar specification is the same for both large and small input formats.

*NASTRAN is a registered trademark of the National Aeronautics and Space Administration.

Irregardless, the process of building a parser can be a laborious one requiring the compiler writer to compute many complicated sets and tables. A complete discussion of parsing and syntax analysis is beyond the scope of this paper and left to the reader.* Special computer programs, called parser generators, have been written to mitigate some of this complexity. They take grammars as input and construct the set of parsing action tables. These utilities are very helpful in instances where the defining grammar may change or be augmented, as is true in this case. The most widely available parser generator is Yacc (yet another compiler-compiler), and it is used to generate the parser for this grammar. The parser generated with Yacc is termed an LALR parser. The "LA" stands for lookahead and the "LR" for left-to-right scanning of the input, rightmost derivation in reverse. This parser has four actions it can perform: accept the input, indicate an error, shift, or reduce. The input is accepted if it can be derived from the gram- mar, otherwise an error is reported. Shifts are the most common operation and are performed while the input is being parsed. A reduction is performed when the right hand side of a grammar production is recognized. Consider a simple grammar with one production 5->abc and an input stream abc. The parsing table for this gram- mar with states and actions is shown in table 3. The actions taken by the parser are shown in table 4. The $ represents the end of input.

Table 3: Parsing table.

state action^ goto abc $ S 0 1 2 2 4

s accept s s reduce

Table 4: Parser actions. stack input parser action 0 abc$ shift 2 0a2 bc$ shift 3 0o2ft3 c$ shift 4 0a263c4 (^) $ reduce 5-^abc 051 $ accept

We start in state 0 with an a as the next symbol in the input stream. According to the table, state 2 is shifted onto a run-time stack. In state 2 with b the next symbol, the action is to shift state 3. State 4 is then shifted onto the stack and all data have been shifted. In state 4 with the end of file marker we reduce by the rule 5-4abc. This pops three states off the stack, leaving us in state 0 with the symbol S on the stack. We then go to state 1 with the end of file marker still the next symbol in the input stream. At this point, the parser accepts the input. LALR parsers can accept ambiguous grammars. Yacc provides mechanisms such as precedence operators to preclude ambiguity. During its final stage of processing, Yacc will actually report the number of ambiguities it encountered and could not

*For additional information regarding issues in parsing and syntax analysis, see [5].

resolve. These errors are either shift-reduce errors or reduce-reduce errors. A shift- reduce error occurs when the parser has reached a state where it could either shift the next input symbol or reduce a right hand side. Reduce-reduce errors occur when the parser reaches a state where two possible reductions could be performed. The grammar for the finite element input files need not be as involved as those for some programming languages. Accordingly, rather than trying to use disambiguating rules, the grammar should be designed so that there are no ambiguous rules. It makes sense to group similar data items in the file. The best way to do this is to partition the data file into logical blocks of similar items. The data is grouped by using the nodes and elements tokens. These tokens inform the parser of what to expect next in the file and allow the data to be grouped in a manner such as: NODES

ELEMENTS

Given all of the above information, figure 2 lists a first try at specifying a grammar for the nodes and elements of the finite element input file.

startpoint —>• items | startpoint items itemsnodes node-list | elements element-list node-list -> integer real real real | integer real real real node-list element-list — > integer integer real integer integer integer | integer integer real integer integer integer element-list

Figure 2: A possible grammar specification.

2.2.1 Structuring the grammar for Yacc

The grammar given in figure 2 is easy to understand. The start symbol is called startpoint. This nonterminal can derive one item, or many items by recursively calling itself. This is a left-recursive rule. Notice also that right-recursive rules are used in figure 2 to specify the list of nodes and list of elements. LALR parsers can accept grammars which have both left and right recursive rules. These rule structures are often used for specifying lists. The list of items includes nodes and elements. The list of nodes and elements specify the sequence of tokens that should be encountered. The lists of nodes and elements continue as long as a valid sequence of real and integer tokens are read.

2.2.2 Parsing with Yacc The Yacc specification file is given in appendix B. The beginning of the file is similar to the Lex specification where included library routines are listed. Following this a list which defines the tokens. Some tokens have attributes associated with them. For instance, the token integer should have some integer value associated with it. This association of tokens with actual data is accomplished using the C structure feature. The lexical analyzer will set the integer attribute in a code fragment upon encoun- tering an integer, the real attribute upon encountering a real, etc. This structure is created with the '/.union statement. The variable yylval assumes this structure. Ac- cordingly, upon encountering an integer in the input data, the lexical analyzer can set yylval. integer equal to the actual encountered integer. The start point is defined to be startpoint. Enclosed by the 'IX symbols is the context-free grammar in Yacc syntax. The grammar is identical to that given in figure 3 with a minor difference. Actions are placed inside {} symbols. As an example, consider the node-list productions. The actions involve actually storing the data encountered during the parse into some structure for later use. In this case, the numbers being read are stored into arrays. The $ allows access to the values that were assigned in the lexical analysis section. In the statement

node-list : INTEGER REAL REAL REAL

the actual integer value associated with the integer token may be accessed by using the $1 operator since it is the first token to the right of the colon. The real values associated with the real tokens are accessible by using $2, $3, and $4. The correction by -1 for the arrays is attributable to the difference in the way C and FORTRAN handle array storage. Following the second '/,'/, to the end of the file are various supporting functions.

3 Combining the parts

The Lex and Yacc specifications have been described in some detail. The only re- maining point of discussion is how to properly tie these items together. Since most of the computing is done in FORTRAN, the driver for the parser is also given in FOR- TRAN. The code for this routine is listed in appendix C. C and FORTRAN code can easily be combined. The main concern is making sure that the variable types match between the two languages. Appendix D lists the header file for the Lex and Yacc routines which defines the C structure to match variables in the FORTRAN

structure. To compile the Yacc specification file, issue the command yacc -d filename. This creates a file named y.tab.c. The -d option instructs Yacc to generate a file named y.tab.h containing token definitions which must be included into the Lex specification

file. To compile the Lex specification file, issue the command lex filename. This produces a C file named lex.yy.c. The files lex.yy.c and y.tab.c must be compiled separately from the FORTRAN files by issuing the command cc -c lex.yy.c y.tab.c. This will generate files which can be linked and loaded by the FORTRAN compiler. The final command to create the executable is then f77 y.tab.o lex.yy.o driver./.

4 Results

This section lists some parsing time results for the LALR parser generated with Lex and Yacc and also some comparisons to a parsing system written in FORTRAN. The FORTRAN parsing technique required one line to be read at a time from the input file. This line was then searched by a routine which identified tokens in the line. Two files were used as test cases. The first was a mesh of a bridge truss having 5,325 nodes, 10,898 triangular elements, and 865,511 total characters. The second file was the mesh of a component of the RAH-64 Comanche helicopter. This mesh comprised 23,348 nodes, 45,990 triangular elements, and 3,697,579 total characters. Parse times were averaged over three trials. The trials were performed on a Silicon Graphics Computer Systems Power Challenge 75-MHz R8000 processor. Table 5 lists the results.

Table 5: Results of parsing trials.

File (size) Parse time (in seconds) FORTRAN LALR (Lex & Yacc) Bridge truss (845 Kbytes) RAH-64 Comanche (3.69 Mbytes)

Table 5 shows some rather dramatic results. The parser generated by Lex and Yacc was able to parse the input files approximately 15 times faster than the corresponding FORTRAN parser. The multiple scanning used by the FORTRAN parsing method severely degrades that parser's performance.

5 Conclusion

Lex and Yacc provide effective tools for implementing LALR parsers quickly and easily. These tools promote parser expandability and impart a logical nature on the entire parser construction process. Most importantly, the LALR parsers generated by Lex and Yacc are extremely efficient. This efficiency easily supersedes that of many other more contrived methods. This reduced parse time is notable and worth pursuing in virtually all data file parsing tasks.