




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This paper describes the scheduling mechanisms and strategies used in the Purdue MACE operating system, a general purpose operating system that supports a high volume batch processing operation and at the same time provides modes of computation usually associated with time sharing systems. The paper deals mostly with the scheduling mechanisms and strategies used in the system. The Purdue MACE operating system is based on the MACE2 operating system which was originally designed by Mr. Greg Mansfield of the Control Data Corporation.
Typology: Lecture notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





by V. A. ABELL, S. ROSEN and R. E. WAGNER Purdue University Computing Center Lafayette, Indiana
INTRODUCTION The Purdue MACE operating system
In recent years there has been a great deal written and published about scheduling and storage management in time sharing systems. During the same period there has been a significant trend toward the development of more general purpose operating systems on large computers. Such systems support a high volume batch processing operation and at the same time provide modes of computation usually associated with time sharing systems. They are multiprogramming and multiprocessor systems that execute jobs that enter the job stream from local and remote card readers, and from local and remote on-line consoles. Some jobs are interactive during execution and some are not. Many jobs use interactive file creation and editing and debugging processors even though they are basically batch jobs. This paper describes some aspects of an operating system of this type that is now running at the Purdue University Computing Center on a CDC 6500 sup- ported by an IBM 7094. The paper deals mostly with the scheduling mechanisms and strategies used in the system. These mechanisms and strategies are probably not new, since all kinds of scheduling disciplines have been proposed and discussed in the literature.^1 How- ever, we believe that this is the first time that scheduling and job movement techniques of the type described here have been implemented and used in a very large system with the high job volume and diversity that characterize a large university computing center.
The Purdue MACE operating system is based on the MACE^2 operating system which was originally designed by Mr. Greg Mansfield of the Control Data Corporation. MACE is an outgrowth of the first operating system for the Control Data 6000 series that was developed at CDC's Chippewa Falls Laboratory.^3 The under- lying design of that first system, the Chippewa Struc- ture, has formed the basis for several of the most suc- cessful operating systems for the CDC 6000 series. These include SCOPE 2.0, SCOPE 3.0-3.4, and MACE. The Chippewa Structure is successful, to a large degree, because it is closely integrated with the unique hardware organization of the CDC 6000 series.^4 That organization consists of one or two central processors (CPU's), and ten peripheral processors (PPU's), all of which share a large, fast central memory of 60 bit words. The CPU minor cycle time is 100 ns, while for the PPU it is one microsecond. The peripheral processors each have a full instruc- tion complement, including arithmetic, shift, and input/output instructions, and 4,096 12 bit words of private storage. They share access to twelve, one megacycle 12 bit wide data channels. The PPU's are pri- marily designed for input/output tasks, communicating through the common central memory with the CPU, which is used mainly to perform computational tasks for executing programs. CDC markets several variants of the 6000 design, each of the same structure, differing from the others only in CPU configuration. The 6600, the fastest sys- tem, has a CPU with parallel arithmetic units. The 6400 has a slower CPU with sequential arithmetic units. The 6500, which is the system in use at Purdue, has two 6400 CPU's. The 6700 has one 6600 and one 6400 CPU.
90 Fall Joint Computer Conference, 1970
Central memory in the Purdue 6500 system con- sists of 65,536* 60 bit words. The memory is organized in phased banks with access time of 100 nsec and cycle time of 1 microsecond.
Central memory organization and the control point
In the Purdue MACE operating system the large central store is divided into a user portion and a central memory resident system area. The system area which now occupies just under 11000 words, contains alloca- tion tables, routine and file directories, a small amount of system central processor code (most of the system executes in the peripheral processors), a number of key peripheral processor routines, and a set of job con- trol blocks, known as control points. A control point is a pivotal area, occupying 128 words of central memory, through which job execution is controlled, and to which the resources for job exe- cution are allocated. The control point may be thought of as the control element of an individual computer, and the entire set of control points as a division of the hardware machine into a number of separate ma- chines, each of which can execute an independent task. The number of control points was fixed at eight in the original Chippewa System and was retained at that number in most derivative systems. One control point is allocated to various system overhead func- tions—storage movement, mass storage space alloca- tion, etc. The remaining control points can be as- signed to active jobs, including the control of input- output devices such as card readers, line printers, remote batch stations, and keyboard consoles. While the MACE system retains this control point allocation method, it provides for the optional declaration of as many as 26 control points at system load time. A job is assigned to an active control point after it has been queued to a system mass storage device (usually a disc storage unit). The resources required for the execution of the job are allocated to the control point. These include central memory space, central processor time, peripheral processor assistance, mount- able equipment (tapes, disc packs, etc.), mass storage space, and file pointers. The resources are allocated to control points through a monitor program which runs in a dedicated peripheral processor. A second dedicated peripheral processor runs a display program, DSD, that provides operator- system communication via a twin screen, display- keyboard console. The remaining peripheral processors are pooled for
input-output and job sequencing functions. Each contains a small resident executive containing com- munication, overlay loading, and mass storage driver subroutines. The pool peripheral processors consti- tute one of the resources assigned to control points by the monitor, and execute programs which com- municate with the monitor through central memory registers. The control point area, which occupies a fixed por- tion of central memory, contains pointers relating to job status, and the resources assigned to the job. In- cluded in the control point area are a 72 word buffer, used to contain the control statements supplied for the processing of the job, and a 16 word area called the exchange package. The exchange package is used by the system monitor to control CPU allocation. A special hardware instruc- tion, called an exchange jump, permits the monitor to interrupt a running CPU, save its register contents, and load all registers with new contents in a single operation. The exchange jump instruction, which executes in 2 microseconds, uses the read portion of the core memory cycle to obtain a word of register contents from the exchange area, and the write portion to store the previous contents of the corresponding registers of the interrupted CPU. When a job is at a control point waiting for the CPU, the exchange package area contains the register con- tents that are required to start or resume processing of the job. When the monitor performs an exchange jump for that control point, the registers are loaded from the control point area, and the control point area is loaded with an exchange package that the monitor uses to return control to the system when the job is interrupted or terminated. The rapid CPU switching capability provided by the exchange jump operation works in conjunction with a relocation and limit register in each CPU to provide an efficient method of memory allocation. The relocation registers in the CPU permit the assignment of a contiguous region of central memory to a program, which is totally isolated from any other area, and which can be moved rapidly to, from and within the user portion of central memory.
Limitations of the Chippewa structure
While the Chippewa Structure in its basic design permits effective multiprogramming use of the 6000 system, it includes some static elements which seriously limit system performance. The major achievement of the MACE system was the relaxation of some of these control point restrictions. This process has been carried further in the Purdue MACE system.
92 Fall Joint Computer Conference, 1970
sequence which will free the required resources. The rollout sequence is built from the list of running jobs whose queue priorities are lower than that of the job being scheduled. Central memory space and control point availability are the two factors considered. Rollout density is controlled by the system monitor. In the normal job scan cycle, a job marked for rollout is assigned a peripheral processor by the monitor, unless a prespecified number of rollouts are already in progress. In the Purdue MACE system, the monitor limits the number of concurrent rollouts to two. Once the job scheduler has started a rollout sequence, rather than wait for the sequence to complete, it con- tinues to search for lower priority jobs which can be assigned to control points without affecting the rollout sequence, or starting another sequence. When the scheduler exhausts the lists of waiting jobs, it ter- minates. The scheduler is recalled periodically, at the end of each rollout step, or when some other job sequencing operation changes the state of the machine. When recalled, the scheduler builds a new snapshot of the environment, effectively "forgetting" the job which started the rollout sequence. Because the scheduler "forgets" that job, it can respond very quickly to changes in the queues. Thus, for example, if a job enters the queues with a priority higher than the one which started the rollout sequence, that job can be executed first. Or, for example, if a job outside the rollout sequence terminates before the sequence is complete, the job causing the rollout sequence can be assigned for execution as soon as the required resources become available. On a sub-multiple of its basic period, the job schedul- er executes an overlay which adjusts queue priorities. The queue priority adjustment overlay modifies the priorities of jobs in the input-rollout queues, and those of jobs in execution at control points. The modification of priorities for queued jobs is essentially an aging operation, to insure that jobs of equal starting priority and resource requirements proceed on a more-or-less first-in, first-out basis. The queue priorities of jobs in execution are modified as a major tactic in queue balancing. This modification is a portion of a three level management of job queue priority, in which the queue priority of a job is set to a high value when the job enters the input queue, is dropped to a lower value after an allotment of execu- tion time has elapsed, and is incremented each suc- ceeding time the job reaches a control point. When a job enters the input queue, it is assigned two queue priority values, a "first pass" and an "execution" priority, both based on its resource parameters. The first pass queue priority is based upon a user specified
(but account limited) value, an input increment, and an origin increment. Currently each job receives an input increment of 6000 8 points, and an origin incre- ment of zero for local batch, 100s for remote batch, 3008 for remote teletype, and 500 8 for interactive orig- ination. The user value ranges from zero to 24 8. The second queue priority value is based upon job parameters and account code classifications. The job parameters include central memory requirement, cen- tral processor time requested, and the predicted output volumes. The execution queue priority value is con- structed from a table of range increments for each parameter. In general, the larger the parameter the smaller the increment it will add to the execution queue priority. When the job input file is completed, it is queued at its first pass queue priority value. The execution queue priority value is stored in the job input file. When the job reaches a control point, the execution queue priority is stored with other job description parameters in a control point area. Thus it is available to the queue priority adjustment overlay of the job scheduler. In scanning control point jobs, the queue priority adjustment overlay is preset to consider those jobs which have accumulated a specified amount of exe- cution time. When a job has reached that level, its first pass queue priority is replaced with the execution value. In almost all cases the result is a drop in queue priority. Currently, the first pass queue priority is replaced by the execution priority after a job has accumulated a total of twenty five seconds of central and/or periph- eral processor usage. With a large input stream volume, the modification usually results in the rollout of the job. However, in the Purdue job mix, 75 percent of all jobs complete before the modification takes place. For the user, the chosen time increment permits rapid turnaround for compilation-debugging runs, and usually guarantees that a job which aborts because of compilation errors will pass through the system very rapidly. The remaining jobs which do not complete before the queue priority modification takes place must run to completion at their execution queue priority values. Several factors combine to enhance their throughput. The first is a dynamic storage reduction performed by the relocatable loader. This improves job through- put because compilation and loading usually require more memory space than execution and usually com- plete before the queue priority modification takes place. Thus the additional execution time which the job re- requires can often take place at the reduced field length set by the loader. Secondly, jobs are aged by the scheduler's queue
Scheduling in General Purpose Operating System 93
priority adjustment overlay. Thus as a job remains in the queues, its priority gradually increases. Finally, each job which is scheduled to a control point receives a small, additional queue priority increment. The control point increment, which is currently set to four aging units, is designed to protect the rollin time investment. The job is given a queue priority boost in an attempt to keep it in execution for a long enough time to make its rollin time cost reasonable. Otherwise, one could easily envision a job mix in which rollin-rollout operations enter a rapid cycle, induced by the aging process.
Control point and central processor utilization
The Purdue MACE system typically runs in an eleven control point configuration with one control point allocated to basic system functions as described in an earlier section. Three others are reserved for use by system input-output processors, one for the queuing (spooling) of peripheral I/O, one for remote batch terminal control, and one for PROCSY, an on-line console system. These three control points require small amounts of memory, determined by the number of active devices. They use very little central processor time, and a larger amount of peripheral processor time for the input-output operations required. The remaining control points are used for the exe- cution of user problems. The two central processors are cycled among active jobs on a round-robin basis. Each job at a control point which requires a CPU is allocated one for a 65 millisecond time slice. The ex- change jump operation keeps the switching overhead very low. Typically it is less than 100 microseconds per transfer. A job that issues an input-output request may re- tain the CPU for the full time slice and attempt to overlap its own computing with its I/O transfers. Al- ternatively, it may give up the eentral processor for the duration of the I/O transfer. A job that surrenders the CPU when it makes an I/O request is given another 65 millisecond time slice as soon as the I/O transfer is completed. Other algorithms for the scheduling of central pro- cessors to jobs are being considered, but so far there is no evidence that the other algorithms provide any advantage over the round robin with a relatively short time slice.
Job mix
Since the jobs that are running in the system may vary greatly in their demands on system resources, it
is good scheduling strategy to attempt to maintain a mix of active jobs at control points that require dif- ferent resources and that make full use of these re- sources. Ideally there should be one or two jobs whose demands on CPU time are large compared with their input-output requirements, and one or two comple- mentary jobs which require only small bursts of CPU time, and have a great deal of I/O activity involving non-conflicting devices. The Purdue MACE job scheduler does not now consider these job mix factors in its calculation of queue priorities, since that would require data about the job profile that is not currently available in a form in which it can be used by system routines. Some job mix factors can be introduced manually in the present system through operator typeins that alter the queue priorities assigned by the system. A more dynamic automatic scheduling algorithm depends on the measurement and efficient encoding of job parameters relative to CPU and input/output and other resource usage on a continuing basis during the course of the execution of each job. The effectiveness demonstrated by our current use of the priority structure suggests that it would be possible to incorporate job mix factors in the priority value. We are presently considering a priority evalua- tor system to be implemented as a secondary level in addition to and separate from the scheduler already de- scribed. The priority evaluator would use the job profile data, the machine environment, and scheduling constraint parameters to assign priorities which could provide an improved job mix. This type of priority evaluation could be performed at longer intervals, possibly in terms of minutes, could use the faster capa- bilities of the central processors, and would not affect the ability of the primary scheduler to react to rapid changes in system load.
Tapes, disc packs and permanent files One of the major advantages of the autoroll system is the fact that it permits the handling of requests for allocation and mounting of tapes and disc packs and the queuing of requests for access to permanent files in such a way that little or no system resources are consumed by a job While it is waiting for equipment to become available or for tapes or disc packs to be mounted. Consider a job that enters the system with a job- card parameter that indicates that it will use magnetic tape. The jobcard indicates the maximum number of tape units that will be required in parallel, and to sim- plify the discussion we shall assume that this number is one. The job is scheduled to a control point based on
Scheduling in General Purpose Operating System 95
pages (the working set according to Denning 5,e) must be loaded. When the job finishes its time slice, the pages that it had been using are scheduled to be rolled out to the paging drum or disc. In an active system it is very likely that all of the storage that was occu- pied by a job will be needed by other jobs, so that by the time the interrupted job is once again scheduled into core memory none of the pages that it had been using during its preceding time slice are still in core. This situation is almost exactly the same as if the job had been rolled out in its entirety from core memory. Various strategies have been suggested for such paging systems that would roll out all active pages on com- pletion of a time slice. A prepaging strategy would then roll the job, or at least a working set of the job, back in when it was again made active. A system of this type comes very close to an autoroll system in which the whole job is rolled out and brought back in when it is reactivated.
There are some advantages in moving a whole job rather than individual pages. These advantages arise because of the greater efficiency of writing tracks rather than individual blocks during peripheral trans- fers. In the particular storage system in use at Purdue, a half track consisting of 3136 60-bit words is read or written during every 50 msec disc revolution, possibly after an initial delay of 20-100 msec for seek time. I t does not take much longer to move the whole job than it would take to move a few selected pages of the job. There are of course other advantages, and possibly other disadvantages to paging systems. It is not our intention to discuss these here. Rather it is our in- tention to point out that autoroll systems are not neces- sarily inherently less efficient than paging systems.
Use of extended core storage
The efficiency of the autoroll system is enhanced in the Purdue Mace System by the use of Extended Core Storage (ECS) as a buffer for the rollout process. Extended core storage is a large core storage system designed to be used for streaming data to and from central memory. In its full configuration, with a mini- mum of 500,000 60-bit words, a streaming rate of 100 nsec per word can be realized. The present 125K ECS at Purdue transfers data at 400 nsec per word. The 250K configuration scheduled to be installed in the summer of 1970 will increase the streaming rate to 200 nsec/word. In the Purdue Mace system, whenever a rollout is signalled, the entire central memory field length of the job being rolled out is moved to ECS at the full
ECS streaming rate. The space that was occupied in central memory is then immediately released for use by other jobs. The contents of the field length that was streamed to ECS is then moved from ECS to a disc storage file at the same rate as it could have been moved directly from central memory to disc storage. Since the transfer rate to disc storage is about 62000 words per second, the use of ECS to buffer the rollout process makes the field length of central memory that is being freed available from several hundred milli- seconds to several seconds earlier than in the unbuffered system. The reverse process of staging input or rollin files in ECS is under consideration, but its implementation would require some major changes in job movement strategies which are now being studied and evaluated.
Performance Every job that goes through the system causes a sequence of messages to be written in a system file called DAYFILE. These messages tell when the job entered the input queue, how long it took in compila- tion and loading, how many times it was rolled out, what error conditions were encountered, when the job entered the print queue, etc. The DAYFILE data is used for billing purposes, and also serves as a data base for a number of programs designed to present a picture of the performance of the system. Some of the details of the programs and techniques used will be presented in another report. The Purdue MACE operating system was phased into operation during the summer of 1969 and took over as the only production system by the end of August. In the first full month of operation, September, 1969, a total of just over 25,000 jobs were run. Of these about 9000 were remote console jobs submitted by way of the newly introduced PROCSY system. By October of 1969 the total number of jobs was over 60,000 of which about 25,000 were PROCSY jobs By February of 1970, the last month for which sta- tistics are available at the time this is being written, the total number of jobs run in the relatively short month had risen to 80,000, of which about 45,000 were PROCSY jobs. The system has been able to absorb this very large increase in console-submitted jobs without seriously affecting its ability to handle batch jobs. The console system is now almost exclusively a remote job entry system. During the next few months we expect a very large volume of interactive computing to be added as the new interactive text editor and a new interactive algebraic language processor come into full production. Some hardware and system software
96 Fall Joint Computer Conference, 1970
changes are being made to accommodate this increased load, but it will be essentially the same system with the same scheduling and job movement mechanisms. On a typical busy day there are now in excess of 10,000 rollouts. Short jobs, whether entered through card readers or through typewriter consoles get very good turnaround. Longer jobs are mostly relegated to the rollout queues during the main shift in which input activity is very heavy. Most of them are com- pleted during the late night shift when the console system is turned off. There is of course a very substantial amount of system overhead associated with rolling out and rolling back in over 10,000 jobs per day. This overhead does not seem to be too high a price to pay for the ability to handle interactive jobs, and the ability to imple- ment scheduling strategies like those that give fast turn-around to short debugging runs. In addition, statistics gathered before and after the introduction of the autoroll system show that the Purdue MACE system is more efficient in its CPU
utilization and in its central memory utilization than its predecessor systems.
1 E G C O F F M A N J R L K L E I N R O C K Computer scheduling methods and counter-measures A F I P S Conference Proceedings Vol 32 Spring 1968 p 11- 2 Control Data Mace operating system preliminary reference manual Control D a t a Corporation Publication No 44613900 3 Control Data Chippewa operating system reference manual Control D a t a Corporation Publication No 60134400 4 Control Data 6^00/6500/6600 computer systems reference manual Control D a t a Corporation Publication No 60100000 5 P J D E N N I N G The working set model for program behavior Comm of the ACM 11 5 May 1968 p 323- 6 P J D E N N I N G Virtual memory Technical Report Number 81 Computer Science Laboratory Princeton University January 1970