















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
1 / 87
This page cannot be seen from the preview
Don't miss anything!
















































































Revision History
Revision Date Description
Version 1.0 March 3, 2008 Initial draft
Version 1.1 May 12, 2008 Incorporated comments from PMO and development team. Added logical data model and system-level component model
Version 2.0 September 5, 2008 Restructured SDD to multiple volumes of individual design documents. The current document is updated to focus on high-level system architecture and design. Added details on OAIS model implementation.
ii
1. Introduction
This document describes the architecture and system design of the Federal Digital System (FDsys) for the U.S. Government Printing Office. It is a living document that evolves throughout the design and implementation for each release. Each release will have an edition of the document, and the current edition of the document is for the first public release R1C2.
The goal of this document is to cover the high-level system architecture and design. The document is divided into three major parts: system architecture, software design and external interfaces. The system architecture includes views from various perspectives. The software design details the main software components that operate under, and support the system architecture. It also includes the content lifecycle and workflows that support the business operations of the system. External interfaces are documented in a separate section because of their special roles in the system.
The system design document (SDD) for FDsys consists of multiple volumes of individual design documents. In addition to the current document, which focuses on high-level architecture and design, separate detailed design documents are created for each of the major components of the system and data management documents for each type of the publications that are managed by the system. Table 1.2-1 lists the SDD volumes along with their coverage for R1C2.
SDD Volume Coverage Lead Author
Volume I: System Architecture
High-level system architecture and design, OAIS model mapping, logical data model, etc FDsys Architect
Volume II: Content Repository
Detailed content repository design, including CMS object model for the content, CMS configuration, application security, and content management and archival functionalities, etc
Documentum Architect
Volume III: FDsys Publish
Architecture and detailed design for publishing content for public access FAST Architect
Volume IV: Search API
Architecture and detailed design for FAST search engine APIs FAST Architect
Volume V: Search Configuration
Architecture and detailed search engine configuration, benchmark methodology, and server failure recovery strategies for the FAST search engine
FAST Architect
Volume VI: Custom Application
Custom applications for FDsys, including parser framework, content source module, web applications as the search front-end to the FAST search engine, etc
Chief Software Engineer
Volume VII: Data Management Definition Document - Federal Register
Detailed data management document for Federal Register FAST Architect
Volume VIII: Data Management Definition Document - Congressional Bills
Detailed data management document for Congressional Bills Data Analyst
More DMDs … as they complete
The design documentation is in general for anyone who wants to understand the system architecture and design of FDsys. The following groups are in particular the intended audience of the document.
The current document is organized as follows.
Section Purpose Section 2: System Overview To describe the purpose of the system, and provide a conceptual design, along with some high-level design considerations. Section 3: Scope of Release 1C.2 To describe the scope of the R1C2, and the incremental development approach for FDsys implementation. Section 4: System Architecture To present the system architecture of FDsys, by viewing the system from various perspectives. Section 5: Software Design To describe the content data model of FDsys, and functional designs of the system to support the data model and business operations configured in the workflows. Section 6: Business Process Implementation To describe how content flows in the system, and the workflows that support daily FDsys operations. Section 7: External Interfaces To document the external interfaces in R1C2. Section 8: Deployment To summarize the main deployment architecture at the hardware and application level.
Concept of Operation for the Federal Digital System. Federal Digital System Requirements Document 3.2. Reference Model for an Open Archival Information System (OAIS).
such as content ordering. Therefore interactions with applications in other two information systems pillars in later releases are also supported in this functional area.
Content Preservation
While content management is at the heart of its functionality sets, FDsys goes beyond what the standard enterprise content management systems provide. One of the critical missions of FDsys is to preserve the content in its original form and to perform preservation processing on the content and technology refreshment to achieve the goal of making the content permanently accessible.
Content Access
In addition to the new strategic mission of the long-term content preservation, another critical mission for FDsys is to become the next generation of the GPO Access for content access and dissemination. The current GPO Access was built more than a decade ago with a primary focus on making the government publications available online. Its architecture and enabling technologies have shown serious weakness to efficiently support its business functionalities without frequent and intensive manual interventions.
The access component of FDsys will subsume functionalities of the current GPO Access with a new architecture and design supported by modern technologies. Since its high visibility is currently supported by a failing architecture along with the complexity of the processes involved in daily operations, the current GPO Access will be replaced as one of the high priority features for the first public release of FDsys, the R1C2 release.
To accomplish its missions, conceptually FDsys has three major subsystems as depicted in Figure 2.2-1. Two separate content repositories are created respectively for the content management and content preservation subsystems. These two subsystems are accessible only within the GPO intranet. The repository for the content management is to support daily operations of FDsys, such as accepting content submission, updating existing content and metadata in the system, and publishing content and metadata for public access. The archival repository is to support the content preservation. Preservation processes in post R1C2 will all be performed on the archival repository. The two repositories communicate with each other when necessary, but each has its own independent storage for the content.
The access subsystem is in the DMZ for public content access and dissemination. The publicly accessible FDsys packages are published from the content management repository to the access subsystem, which processes the content and associated metadata and make them available online for general public access.
This high level conceptual view of the three subsystems will be reflected in the system architecture and application designs throughout the system design documentation.
Figure 2.2-1 FDsys Conceptual Design
FDsys supports two categories of users: authorized users and public users. The content management and preservation subsystems are only accessible to the authorized users. Authorized users are further categorized to functional specialists and system administrators or managers. The following lists the specialists and managers that are supported in R1C2.
The OAIS information model describes the concept of Information Package and defines what should be included in the Information Package. The OAIS model proposes three Information Packages: Submission Information Package (SIP), Archival Information Package (AIP), and Dissemination Information Package (DIP). Each information package includes digital objects to be preserved, metadata required to describe the digital objects, and the packaging information that associates the digital objects with their describing metadata.
Finally, the responsibilities of an OAIS archive required by the OAIS model are:
As clearly indicated in the ConOps and specified by a set of specific requirements, FDsys will follow the OAIS reference model to manage the content lifecycle. While some of the OAIS entities, such as Data Management for the archive, can be mapped to implementations of relevant functionalities from the commercially available enterprise content management systems (CMS), the commercial CMS products are in general not designed to conform to the OAIS model - the OAIS information model for long term preservation in particular. This presents a challenge for FDsys to implement the reference model by using the out-of-box features of a commercial CMS product.
Every commercial CMS product has its own proprietary data model for the content it manages. Though the implementation approaches vary from one product to another, the separation between the content and metadata is common to all data models of the COTS CMS products. While the content may be stored in various storage devices such as file system, the metadata are normally stored in a persistence store such as relational database. How the association between the metadata and content is modeled and managed varies widely between the CMS products, and has become one of the key differentiators between the competing products.
A simplest and easiest implementation of a content management system would be to use the out- of-box content data model of a COTS CMS, and leverage the application tools usually bundled with the CMS offering to manage the content lifecycle with little customization. Apparently the implementation of this type creates a total dependence on the underlying CMS, and has little flexibility to adapt to technology evolutions over time. This approach, therefore, will be unable to fully meet the FDsys requirements for its independence of the underlying supporting technology.
By the information package concept, the OAIS model proposes a high level abstraction that creates the opportunity for an implementation independent packaging scheme. This is especially beneficial for long-term preservation, which normally has to outlive the lifecycle of the underlying technology that facilitates the preservation process. Through its carefully designed content data model and self- describing archival package, FDsys provides an implementation of the OAIS AIP by leveraging the content management capabilities of the COTS CMS product with an XML-based abstraction layer to
minimize the dependence on the underlying CMS product. Details of the FDsys implementation strategy for the OAIS model can be found in 5.x.
Metadata management is at the heart of all content management systems, and is one of the most key functionalities of FDsys. Unfortunately the commercial CMS products, as mentioned earlier, all have their own proprietary metadata models, which in fact have become one of the critical differentiators between the competing products. The non-standards based metadata models present a problem for FDsys to accomplish its mission - to preserve and disseminate the content and metadata over an indefinitely long period of time. This mission requires that FDsys implementation remain flexible and not tied to any proprietary CMS implementation, and must to be able to adapt to technology changes over time.
To achieve this goal for long-term preservation, the FDsys requirements specify that all metadata for FDsys content must be in XML form, promoting an implementation that manages the metadata for FDsys content independently of the metadata model of the underlying CMS product. The requirements also reflected the fact that most metadata standards from library and other information management communities are in XML form. Managing metadata in XML enables FDsys to easily interface with other systems when needed.
It is noted that managing the metadata solely in XML and independently of the underlying CMS metadata model requires extensive customization. FDsys meets this set of requirements, along with the OAIS packaging requirements, by implementing an abstracted packaging service on top of the supporting commercial CMS product. The XML files containing the FDsys metadata are treated by the underlying CMS as regular content files, and the packaging service extracts the descriptive metadata from the CMS metadata model and populates them to the XML metadata files in the archival package. With all descriptive and technical information stored in and available from the XML files for a package, the package becomes self-describing and independent of the CMS that is used to create the package.
It should be pointed out that these XML files are for metadata only, majority of the FDsys content are not in XML form, but in file formats like plain text, PDF, TIFF, etc. Therefore the native XML applications offered by a few commercial CMS products (e.g. Documentum) for the XML content management are not applicable to the FDsys content in this category.
FDsys may accept submissions where content themselves are in XML or SGML format. For R1C2, the content in these XML-like formats will also be treated like any other content files (e.g. PDF).
While the GPO Enterprise Architecture is still being developed, FDsys will conform to the established models within the evolving enterprise architecture. The operating platforms for hardware and software, development languages, persistence store for FDsys all conform to the specifications in the TRM (Technical Reference Model) of the enterprise architecture. Specific tools in the TRM for enterprise use, such as LDAP for user authentication, and ESB for enterprise application integration, will also be utilized in FDsys.
3. Scope of Release 1C.
According to the FDsys requirements, the scope of FDsys covers a large number of business functional areas. It is impractical, if not impossible, to develop and deploy the whole system of FDsys at once. An incremental approach for the FDsys development must be adopted to reduce the high risk associated with the all-at-once approach in terms of cost, schedule and overall success of the system implementation.
The Program Management Office (PMO) for FDsys has performed a detailed analysis on the requirements, and divided the requirements into feature groups with priority assignment. The priority assignment has taken into account the inter-dependency of the feature groups. For example, the infrastructure must be laid out first as a platform to enable FDsys operations. The packaging scheme following the OAIS model must be ready before FDsys can accept and process any content for preservation and public access. As a first public release, R1C2 will focus on the following feature groups.
The features in this group are to provide a system infrastructure onto which the FDsys is deployed and performs its operations. The features include hardware infrastructure, security architecture, system availability, backup and monitoring, and integration for FDsys to interface with external systems – the ILS (Integrated Library System) for R1C2.
This feature group is for FDsys to implement the concept of the OAIS Information Packages and to manage the metadata in XML with a set of established metadata standards. It must be implemented before any content and metadata can be accepted and processed by FDsys. The implementation of this feature group must be complete to cover content of various types that are ingested into FDsys at the first release or later releases, because modifications to the content packaging scheme in later releases are highly undesirable and will be prohibitively costly. One exception to the completeness is the supported metadata standards and the system must be designed in such a way that FDsys is able to support new metadata standards introduced in the future. This feature group is considered the fundamental foundation of FDsys for content preservation and public access. It will be one of the primary focuses of the R1C2 development.
Once the packaging scheme and metadata management are ready, FDsys is able to create and preserve the Archival Information Packages (AIP). While supporting functionalities for preservation processes, such as content refreshment, are planned for later releases, R1C2 release will have the AIP created and preserved in an archival repository that is separate from the working repository for daily content and metadata management as shown in the FDsys conceptual design. The preservation processes in later releases will operate on the archival repository created at R1C2.
The content source for R1C2 will be exclusively from the congressional submission by the GPO Plant Operations. The existing business processes established in the Plant Operations will continue to process the content files up to the point where the subsequent processing is to produce WAIS- specific file formats. The content currently available from the WAIS database will be migrated to FDsys, and a submission tool for the migration will be developed in R1C2.
In addition to the migration submission, two types of congressional submissions will be supported in R1C2 for the day-forward content. One is the interactive submission where the authorized users upload the content files and metadata to FDsys through a browser-based FDsys user interface. The second type is a folder-based submission where content files are placed to a predefined hot folder for submission to FDsys. Once the content files are submitted through the hot folder, the authorized users will use the same browser-based FDsys user interface as in the interactive submission case to manage the content files before the final submission for ingest.
The content submission for R1C2 is further discussed as part of the GPO Access replacement.
This group of features is to subsume functionalities available from the current GPO Access, with additional significant improvement features. Apparently this feature set depends on the successful implementation of the packaging and metadata management features. It is another primary focus of the R1C2 release.
The current GPO Access operates on a WAIS infrastructure, a client-server text search system that offers little metadata management capabilities. A few metadata pieces in the WAIS content data are created by the GPO homegrown utility programs before the content is passed over to WAIS indexing process. Some metadata information is embedded in the content directory structure or file names. The lack of a consistent and complete metadata set in GPO Access implies that all the metadata required by the FDsys information packages must be systematically extracted before the WAIS content can be migrated to and processed by FDsys.
Metadata extraction from the content is a challenging task, and a few commercial tools are available but mostly for content of forms with predictable format patterns. FDsys adopts the approach to parse the content and extract the required metadata information through custom parsers. Because of the complexity of the publication structures and large number of variations of the structures from one publication collection to another, the parsers have to be developed for each of the collections while maximizing the usage of common metadata elements that can be applied across collections whenever possible. The parser development is a tedious and time-consuming process; multiple iterations are usually required to complete a parser for a particular collection with an acceptable level of accuracy rate.
As such, FDsys development takes a phased approach for GPO Access replacement to meet the target objectives for each release. For release R1C2, the objectives are:
Figure 3.4.2-1 Current GPO Access Processing and Planned Transition
GPO Access replacement involves two major parts: migrating exiting GPO Access content to FDsys, and enabling a process and ingesting new or day-forward content into FDsys, as indicated by the dashed lines in Figure 3.4.2-1. For migration, the content source includes the content files from the WAIS infrastructure and some additional files from other file systems (Jukebox, Alpha3) currently maintained by the Plant Operations. A migration tool will be developed to traverse the migration files in the staging area for all collections, and feeding the content and a few metadata (that is only available from the original directory structure) for ingest to FDsys. Details of the tool design and the ingest process will be addressed in section x.
For the day-forward content submission, R1C2 will leverage the current content processing up to the selected steps. One such step is after the creation of the screen-optimized PDF file. The PDF files are created manually using the Adobe Distiller in the current processing, along with a table mapping the PDF files to the corresponding text version of the files. Another step is after the homegrown tool – CDTP is invoked. The output of the CDTP (i.e. .done file) is a cleanly formatted text file with necessary tags inserted, if applicable, to indicate the granule separation of the publication. Processing steps after these two are either specifically to prepare the text files for WAIS
indexing or to make the PDF along with other files available to content subscribers. For R1C2, processing steps to further prepare the files for WAIS can be retired after the transition period.
As shown in the dashed arrows in Figure 3.4.2-1, interactive and folder-based tools will be developed to submit the PDF and the CDTP output file, along with other appropriate renditions in other file formats (such as locator, SGML when applicable), to FDsys. Similar to the migration case, the parsers for the day-forward content will be applied to the text version of the content and the extracted metadata along with the original content files submitted for ingest. Details of the day- forward content submission and ingest will be addressed in section x.
Since the current GPO Access is a live system, the industry best practices have shown that a transition period is required to switch from a live system to its replacement to mitigate the high risks associated with a sudden switch. During the transition period, the current GPO Access and FDsys will be in operation in parallel, providing similar functionalities for public access, while FDsys will have improved features in selected areas. It should be noted that the transition period concerns the smooth launch of the replacing system, and in principle should have little impact on the architecture and design of the new system.
In summary, the scope of the release R1C2 is to complete implementations of the above listed feature sets to build the foundation of FDsys packaging scheme for preservation and public access and to replace the current GPO Access with the access subsystem. The content will be managed in the form of FDsys information packages, and also stored as AIPs in a separate archival repository of FDsys. From the implementation point of view, R1C2 will serve as a foundation for later releases to add more features or to enhance the functionalities of the system, completing the FDsys development in the incremental releases.
The FDsys applications are developed to extend or customize the capabilities provided by Documentum. Each of the applications is briefly described below.
Content Parsing
The purpose of the content parsing is twofold. First it is responsible for extracting metadata from content for preservation and access needs. A set of operation interfaces will be defined to provide input and retrieve output from the parsers. All parsers, regardless of whether common to all publications or specialized to a particular publication, must implement the operation interfaces. The parsing framework also provides a mechanism to plug-in new parsers or to remove (and hence capability of replacing) parsers from the service. This is to enable a phased approach for parser development.
The second purpose of the content parsing is to handle publication granules. The content source for R1C2 is limited to the existing WAIS content and day-forward content from the Plant Operations. The publication granules are currently created by the established manual process and tagged in the input content files. So the remaining job for the content parsing is to interpret the tagged granules, that are normally publication specific, and to save the granules to separate files for access.
It should be noted that the content parsing for publication granules in R1C2 is designed to interpret and transform the granules that are manually created during the GPO Access content processing. Until granules are programmatically identified and generated, this design will continue to serve its purpose in later releases. At present, no tools have been found available to automatically create the granules that meet GPO’s granule requirements.
Packaging Application
The packaging application is responsible for managing the FDsys information packages following the concept model of the OAIS. The AIP in FDsys is stored and accessed only for preservation purposes, and a separate archival storage is allocated for the AIPs to achieve this goal. To accomplish another mission of FDsys for content dissemination for public access, FDsys creates an additional package – Access Content Package (ACP) for daily content management. ACP also serves as the source for the content dissemination to the public.
FDsys System Design Document – R1C
Content Management & Archival Subsystems DocumentumRepository
SIP & ACP
AIP
J2EE Web Application Server (Oracle App Server)
Documentum^ - User Authentication &
Authorization
FDsys Applications^ - Packaging Application- Content Parsing- Virus Check- Content Source- Adobe LiveCycle
LDAP
AIP Reconstruct
Access Subsystem Access Storage^ ACP Cache^ Static Web^ J2EE Web Application Server (Oracle App Server)
Pages
FAST ESP^ - Full Text Indexing- Public Search Services FDsys Applications- Search Application- Content Delivery
Integrated Library System Enterprise Service Bus
SpecialistsBrowsers
ArchivistsBrowsers
Firewall
Public Users
PublishContent
PackageReferences
Archivists
Figure 4.1-1 FDsys Application Architecture