Exploring Condor: A Parallel Computing System for Harvesting Idle Cycles - Prof. Jeffrey K, Study notes of Computer Science

An overview of condor, a parallel computing system designed to exploit idle cycles on workstations and dedicated clusters. The system manages both resources and resource requests, using mechanisms such as classad matchmaking, process checkpoint/restart, and remote system calls. Issues related to scheduling, transparency, and checkpoints, as well as condor's standard universe and process checkpointing.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-5t0
koofers-user-5t0 🇺🇸

10 documents

1 / 42

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CMSC 714 –F06 (lect OS) copyright 2003 Jeffrey K. Hollingsworth
Computing Environment
zCost Effective High Performance Computing
Dedicated servers are expensive
Non-dedicated machines are useful
high processing power(~1GHz), fast network (100Mbps+)
Long idle time(~50%), low resource usage
Machines in office
Need cycles to run
my simulations
Computer Lab
Supercomputer
Clustered server
W/S’s and PC’s
Network
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a

Partial preview of the text

Download Exploring Condor: A Parallel Computing System for Harvesting Idle Cycles - Prof. Jeffrey K and more Study notes Computer Science in PDF only on Docsity!

copyright 2003 Jeffrey K. Hollingsworth

Computing Environment

z^

Cost Effective High Performance Computing–

Dedicated servers are expensive– Non-dedicated machines are useful

-^ high processing power(~1GHz), fast network (100Mbps+) -^ Long idle time(~50%), low resource usage

Machines in office

Need cycles to runmy simulations

Computer Lab

Supercomputer

Clustered server

W/S’s and PC’s

Network

copyright 2003 Jeffrey K. Hollingsworth

OS Support For Parallel Computing

z^

Many applications need raw compute power– Computer H/W and S/W Simulations– Scientific/Engineering Computation– Data Mining, Optimization problems

z^

Goal– Exploit computation cycles on idle workstations

z^

Projects– Condor– Linger-Longer

copyright 2003 Jeffrey K. Hollingsworth

What Is Condor?

z^

Condor– Exploits computation cycles in collections of

  • workstations• dedicated clusters
    • Manages both
      • resources (machines)• resource requests (jobs)
        • Has several mechanisms
          • ClassAd Matchmaking• Process checkpoint/ restart / migration• Remote System Calls• Grid Awareness
            • Scalable to thousands of jobs / machines

copyright 2003 Jeffrey K. Hollingsworth

Condor – Dedicated Resources

z^

Dedicated Resources– Compute Clusters

z^

Manage– Node monitoring,

scheduling

  • Job launch, monitor &

cleanup

copyright 2003 Jeffrey K. Hollingsworth

Mechanisms in Condor

z^

Transparent Process Checkpoint / Restart

z^

Transparent Process Migration

z^

Transparent Redirection of I/O– Condor’s Remote System Calls

copyright 2003 Jeffrey K. Hollingsworth

CondorView Usage Graph

copyright 2003 Jeffrey K. Hollingsworth

Some Challenges

z^

Condor does whatever it takes to run your jobs, evenif some machines…– Crash (or are disconnected)– Run out of disk space– Don’t have your software installed– Are frequently needed by others– Are far away & managed by someone else

copyright 2003 Jeffrey K. Hollingsworth

Condor’s Standard Universe

z^

Condor can support various combinations offeatures/environments– In different “Universes”

z^

Different Universes provide different functionality– Vanilla

  • Run any Serial Job
    • Scheduler
      • Plug in a meta-scheduler
        • Standard
          • Support for transparent process checkpoint and restart

copyright 2003 Jeffrey K. Hollingsworth

When Will Condor Checkpoint Your Job? z^

Periodically, if desired– For fault tolerance

z^

To free the machine to do a higher priority task(higher priority job, or a job from a user with higherpriority)– Preemptive-resume scheduling

z^

When you explicitly run– condor_checkpoint– condor_vacate– condor_off– condor_restart

copyright 2003 Jeffrey K. Hollingsworth

Condor Daemon Layout

Personal Condor / Central Manager

master

collector

negotiator

schedd

startd

= Process Spawned

copyright 2003 Jeffrey K. Hollingsworth

Access to Data in Condor

z^

Use Shared Filesystem if available

z^

No shared filesystem?–

Remote System Calls (in the Standard Universe) – Condor File Transfer Service

  • Can automatically send back changed files• Atomic transfer of multiple files
    • Remote I/O Proxy Socket

copyright 2003 Jeffrey K. Hollingsworth

Standard Universe Remote System Calls z^

I/O System calls trapped– Sent back to submit machine

z^

Allows Transparent Migration Across Domains– Checkpoint on machine A, restart on B

z^

No Source Code changes required

z^

Language Independent

z^

Opportunities– For Application Steering

  • Condor tells customer process “how” to open files
    • For compression on the fly

copyright 2003 Jeffrey K. Hollingsworth

Job

Fork

starter

shadow Home

FileSystem

I/O Library

I/O Server

I/O Proxy

Secure Remote I/O

Local System Calls

Local I/O(Chirp)

Execution Site

Submission Site

copyright 2003 Jeffrey K. Hollingsworth

Job Submission Machine

Job Execution Site

Job

Condor-GGridManagerGASSServer

Condor-GSched uler

PersistantJob Que ue

End UserReq uests

CondorShadowProcess forJob X

Condor-GCollector

Fork

Globus Daemons

Local Site Scheduler

[See Figure 1]

Fork

CondorDaemons Job X Condor S yste m CallTrappi ng & C heckpoint

Fork Library

Resou rceIn fo rmatio n Tra nsfe r Job XRedi rectedSystem Call

Data