Click [slideshow] to begin presentation.

 

Introductions

An Introduction to Condor

Dr Simon Hood and Dr Jonathan Boyle

Research Computing Services

rcs@manchester.ac.uk






 

What is RCS?

Research Computing Services Support?




 

Research Computing Services, RCS

  • Specialist part of IT Services.

Contact Details
What is Research Computing?
  • Computing to support research! Examples:
    • running complex simulations;
    • performing vast parameter searches.




 

Research Computing Examples

High Throughput Computing (HTC)
  • Large amounts of comp. power over a "long" time:
    • Running long jobs!
    • Running the same experiment many (1000s) times, with different inputs.
High Performance Computing (HPC)
  • Large amounts of comp. power over a "short" time:
    • many CPUs simultaneously to run complex models quicker.
  • Many compute-nodes' RAM simultaneously to handle very big jobs.
Data Analysis and Visualization
  • Getting the information out of the vast quantities of data.




 

How Does, Can RCS Help You? Free Stuff

Free Stuff!

Provision of resources:
  • Horace, Man1, Man2, Mace01, Redqueen. . .
  • Condor pools; NW-Grid, NGS.
Administration of HPC/HTC Clusters:
  • Administer and support University, NW-Grid, NGS and some school and research group HPC clusters.
Support and Training:
  • Documentation — Web and Wiki.
  • Courses!
  • Usage of HPC/HTC (Inc. Condor) clusters,
  • application support,




 

How Can RCS Help You? In-Depth Support

In-depth support and collaborations

Free dedicated short-term help
  • Advice on parallelisation of code, or
  • advanced use of HTC (inc. Condor).
More in-depth help and collaborations
  • Optimising code/models: scoping, estimate, coding — dedicated resources may require funding.
  • Example: one year's dedicated effort extracting maximumum performance. Named resource/researchers on RCUK/EU etc. grants.




 

Other Related Courses

Introduction to Condor:
  • CPU-cycle scavenging and HPC cluster backfill;
  • Web pages;
Introduction to LaTeX:
Other Courses:
  • Introduction to OpenMP
  • Introduction to MPI
  • Fortran 95
  • Matlab
  • Image-Based Modelling


details and on-line booking. . .




 

This Course

Today's Course

  • Three speakers
    • Simon Hood (RCS) 10:00 – 12:00 approx.;
    • Jonathan Boyle (RCS) and Ian Cottam (EPS)




 

This Course: Part One

Simon (AM)

  • what Condor is;
  • how to use it — simple cases;
  • what Condor is good at and what it's not;
  • what EPS- and RCS-backed Condor facilities are available to you.

. . .con't. . .




 

This Course: Parts Two and Three

Jonathan (PM)

  • Using Matlab with Condor
  • Job control using Dagman
  • Job control and monitoring with BASH scripts

Ian Cottam

  • Condor and Dropbox




 

What is Condor?

From Wikipedia

  • Condor is a high-throughput computing software framework for coarse-grained distributed parallelization of computationally-intensive tasks.
    • ?




 

What is Condor, in English?

Cycle-scavenging
It can farm out computational work to idle desktop computers.
Runs on everything
Linux, Unix, Mac OS X, FreeBSD, and (even) MS Windows.
It can work as a traditional batch system
It can manage workload (jobs) on a dedicated cluster of computers (Beowulf) in place of SGE/LSF/PBS. . .
Glue
Can seamlessly integrate dedicated and other resources, e.g., Beowulfs, and (idle) teaching clusters and/or office desktop machines.
All types of jobs
Can schedule serial and parallel jobs.
Backfill
On traditional HPC clusters. . .




 

Condor Philosophy

[From www.cs.wisc.edu]




 

What is it good for?

Condor is Complementary to Traditional Batch Systems

  • Good for backfill and using "spare" CPU cycles.
  • Therefore, good for running jobs that can fill gaps flexibly.
  • So, jobs which individually do not require great resources, e.g., RAM or diskspace:
    • can run "anywhere";
    • can be checkpointed and migrated easily — requires re-linking.
  • Large numbers of small jobs, e.g., parameter sweeps, are ideal.




Sometimes better to use Condor, sometimes SGE. . .




 

Traditional Condor Pools

CPU-scavenging

  • Use otherwise wasted compute cycles from non-dedicated resources:
    • individuals' office desktops;
    • teaching/public clusters.
  • Converts unused desktops into a distributed high-throughput computing (HTC) facility.
  • Minimal effect on desktop users:
    • Condor jobs start only after zero keybd/mouse input for, say, 15 minutes;
    • within seconds of keybd/mouse input, Condor jobs suspended.
  • All machines in the pool can submit jobs; all will likely run jobs; symmetrical, peer-to-peer topology.




 

Features of Condor

  • Condor machines are members of a pool.
  • Members can be compute nodes, submit nodes, or both — traditionally both.
  • Each pool has exactly one "head node" — the collector/negotiator.
  • Condor manages both resources (machines) and resource requests (jobs)
  • Transparent checkpoint/restart
    • and process migration (for some jobs)
  • Manages large numbers of (small) jobs well.




 

Using Condor: Overview

How do I get computation done with Condor?

  • Ensure your job is batch-ready — requires no user input, no GUI — just as for SGE/LSF/PBS. . .
  • Choose a universe — much more later.
  • Create a small text file which defines the Condor job (cf. qsub script).
  • Submit the job!
  • Monitor progress: output, error and log files.
  • Sit back with a nice mug of tea and enjoy the free CPU cycles.




 

So let's see it!

Demo holding page. . .




 

Command Summary

condor_status
Display status of pool: number and type of machines; status of machines — owner/busy/idle; more. . .
condor_submit
Queue jobs for execution under Condor.
condor_q [-global]
Displays information about jobs in the Condor job queue; defaults to the local queue
condor_rm
Remove jobs from the Condor queue.




 

Command Help

condor_<command> -h|-help
    # ...lists all command-line args... 




 

Running a Job: Overview

Running a Job: Overview

In this module we look at the complete job cycle:

  • Make it batch-ready
  • Choose a Universe
  • Create a submit file
  • Submit the job
  • Monitor your job's status




 

Universes and Job Examples

In this module. . .

  • detail the most commonly used universes in Condor
    • Vanilla, Standard. . .
  • give example Condor submission scripts for each.




 

Data and File Transfer Summary

In this module. . .

  • Summarise use of remote IO and shared filesystems in Condor.
  • Outline how to explicity transfer required input and output files.




 

Class Ads

In this module. . .

  • What class ads are
    • workstation resource ads
    • job ads
  • Class ad matching
  • Debugging via -better-analyze




 

RCS Pools, Backfilling Dedicated HPC Systems

In this section we:

  • how Condor can "backfill" traditional HPC clusters;
  • Condor facilities offered by RCS.




 

Condor and Grid Computing

In this module:

  • we define what we mean by grid;
  • outline (only) how Condor can help with grid computing.




 

Installing Condor

In this module we outline:

  • where to get the software from;
  • how to set up a Linux machine to join a Condor pool;
  • and how to set up a Condor pool from scratch.




 

Networking, Topology and Firewalls

In this module:

  • [Placeholder]




 

Condor and Matlab

  http://www.liv.ac.uk/e-science/condor/matlab/
Or simply use nodes with a shared filesystem?

In this module:

  • how Condor can "backfill" traditional HPC clusters;
  • Condor facilities offered by RCS.