Downtimes - MSCF HPC Planned Outages

MSCF Power Outage - Impact Opus
12/27/2002 - Systems in the MSCF will be down next Friday starting at 3PM (1/3/2003). The systems will return to full operational status by 8AM on Monday morning (1/6/2003). The outage is to perform facility modifications to the MSCF in preparation for the second phase of the HP supercomputer. We will not have the normal Thursday outage on Opus on 1/2/2003. For additional information contact R. Scott Studham (scott.studham@pnl.gov).
MSCF Power Outage - Impact Colony
12/27/2002 - Colony will be down next Friday starting at 3PM (1/3/2003). Colony will return to full operational status by 8AM on Monday morning (1/6/2003). The outage is to perform facility modifications to the MSCF in preparation for the second phase of the HP supercomputer. For additional information contact R. Scott Studham (scott.studham@pnl.gov).
Opus Scheduled Downtime and Upgrades
12/17/2002 - We have a standing downtime on Opus every Thursday from 7AM until 2PM. For additional information contact R. Scott Studham (scott.studham@pnl.gov).
MSCF Computers Estimated to Be Operational April 23, 2002
02/26/2002 - The facility modification took longer than expected, delaying the restart of the MSCF computer systems. We are targeting 8:00 AM PST Tuesday April 23 for the MSCF computer systems to be fully operational. For additional information contact Julia C. White (Julia.White@pnl.gov).
EMSL Services Outage April 19-21, 2002
02/26/2002 - Important News for Users of the Molecular Science Computing Facility. The outage on April 19 to April 21 will impact ALL of the MSCF users. No jobs will be started on NWMpp1, Jupiter, Colony, and ECS1 after 2:00 PM PST on Friday April 19. Any jobs still running at that time will be terminated. SDM will also be unavailable during this outage. Systems are estimated to be fully operational after 12:00 PM PST on Monday April 22. Further details will be posted as they become available. For additional information contact Julia C. White (Julia.White@pnl.gov).
Colony Outage
03/22/2002 - Due to problems with Giganet drivers and large jobs, we are going to take colony down for a couple of days. We plan to reinstall the old version of the OS and driver software to be able to verify that large jobs still run and that we don't have bad hardware. Assuming that all goes well, we should be back up next week. For additional information contact Gary Skouson (Gary.Skouson@pnl.gov).
Colony Down for Upgrade February 26, 8:00am through February 28, 12:00pm
02/22/2002 - We plan to upgrade the colony systems next week starting at 8:00am on Tuesday, February 26. We plan to have the systems back online for use at noon on Thursday, February 28. We will upgrade to RedHat 7.2 with Linux kernel 2.4. We have set a system reservation to keep jobs from starting that will run into the outage time. If you submit jobs that won't complete by the outage start time, they should remain in the queue and be started once the outage is over. For additional information regarding this scheduled downtime contact Gary Skouson (gary.skouson@pnl.gov).
Scientific Data Management (SDM) Software Outage on Tuesday, January 22
01/17/2002 - Reason for Outage: Upgrade Avatar's operating system. This should not affect FTP users. If you have questions about SDM or would like assistance, please feel free to contact Julia C. White (Julia.White@pnl.gov).
NWMpp1 and supporting service outage from 8:00 AM PST to 6:00 PM PST, on Saturday, December 1
11/20/01 - MSCF Operations has scheduled an outage for NWMpp1 and supporting servers on Saturday December 1, 2001 from 8:00am until 6:00pm. NWEcs1 and Jupiter will not be able to run jobs as they depend on the same qbank, u1 and u2 home directory servers we are updating. Tasks we expect to accomplish during this interruption are:
  1. Reboot NWMpp1, it will have been up almost 300 days straight by then.
  2. Install GigE adapter cards in the home directory servers to provide faster access to u1 and u2 for users and backup operations.
  3. Reboot home directory servers and qbank, they have been up over 500 days.
  4. Re-cable the home directory disk drives to enable increasing the size of u2
    and reduce complexity.

Any job that wants to run past our outage will not be allowed to start by the scheduler. We will see a decrease in utilization approaching Saturday morning, this is normal. Any job in the queue will remain in the job queue throughout the outage and will start automatically after the scheduler is enabled.

Colony outage from 8:00 AM PST to 12:00 PM PST, on May 23
5/18/01 - We need to schedule a Colony system outage to perform updates to the cluster. These updates include: adding access to the /u1 and /u2 filesystems that are available from the other MSCF systems, and updating network drivers on the fileserver nodes. This will require that the login nodes be rebooted. We plan to have the system down on Wednesday, May 23, from 8:00am - 12:00noon. Gary Skouson, mscf-support@emsl.pnl.gov
Jupiter outage from 8:00 AM PST to 11:00 AM PST, on April 26
4/20/01 - Jupiter will be down from 8:00 am to 11:00 am Thursday, April 26. We will be performing preventative maintenance on the system during that time. Among other things, this should fix the problem where signal 15's do not terminate running jobs. I have set a reservation on the system for the duration of the outage. Any jobs that would start during that time will be delayed until after the outage. If you have any questions, please feel free to contact me. - Dave Cowley, Jupiter Lead System Administrator.
NWMpp1 unavailable for 24 hours, on February 13
1/23/01 - Starting at 7AM PST on 2/13/01 MPP1 will be unavailable for use, the outage will last 24 hours. The system has been up since October, this downtime will allow is to fix some of the preventative maintenance issues and reboot the system. If you have any feedback on issues that need to be resolved on MPP1 please do not hesitate to call or send me an e-mail.. - R. Scott Studham, MSCF Computer Operations TGL. Phone: 509-376-8430
NWTest scheduled to return to service February 14
1/24/01 - NWTest has been undergoing testing in preparation to the upgrades to NWMpp1. NWTest is scheduled to return to service on February 14 - mscf-support@emsl.pnl.gov
NWEcs1 unavailable, starting January 29
1/23/01 - Monday morning, we'll shut ECS1 down for hardware and software installation. We'll probably be down about a week for the installation. At that point some initial users should get access and we'll iron out any problems that come up. It should be up and stable a couple of weeks from Monday when it goes down. Tentatively scheduled to resume operation by February 12. - mscf-support@emsl.pnl.gov
Colony outage on Wednesday, January 31
1/23/01 - We are currently planning an outage on Colony for 1/31/2001. During this outage we will: 1. Install new hardware in the ethernet switches. 2. Upgrade the software on the ethernet switches. 3. Put a recompiled kernel on the pvfs and serial console servers. 4. Add king01 (the third serial console server) into pvfs. We are also considering adding another couple items to that list. Colony will be down on 1/31/2001. The system should be available by the end of the day, with the possible exception of pvfs that may not be available until the end of 2/1/2001. At the end of 1/31/2001 I'll send a note on the status of Colony. - Ryan Braby, MSCF Computer Operations
NWTest will be unavailable until further notice, starting December 18
12/13/00 - In preparation for the MPP1 software upgrade the MSCF Computer Operations group will be testing upgrade procedures on NWTest. Effective 12/18/00, and until further notice, NWTest will not be available for user jobs. This will not impact MPP1. - R. Scott Studham, MSCF Computer Operations TGL. Phone: 509-376-8430
Jupiter will be unavailable for use, starting 1/10/01
01/02/01 - We intend to upgrade Jupiter and migrate it to a more homogeneous configuration with the other MSCF managed resources. This upgrade will allow us to better manage the system, and make 64bit LAPI/MPI available. Starting 1/10/01 Jupiter will be unavailable for use. The upgrade normally takes 2 weeks. However we are attempting to expedite it and return Jupiter to service on 1/17/01. The system should be stable by 1/24/01. - R. Scott Studham, MSCF Computer Operations TGL. Phone: 509-376-8430
SYSTEM Outage on NWMpp1, NWTest, NWEcs1 for Home Directory Expansion on Tuesday, October 17, 6 PM PST - 12 AM PST
10/06/00 - We are doubling the capacity of the home directories /u1 and /u2. We will piggyback our outage with the EMSL Network Outage on 10/17/2000 at 6:00PM. Any job scheduled to run past that time will not start. Any job left in the queue will automatically start that night when we restore the system. The outage should not exceed 6 hours as some ATM switch work is also being performed. - Ralph E. Wescott, Systems Analyst, EMSL MSCF.
Scientific Data Management (SDM) Software Outage on Tuesday, October 10, 4 PM PST - 6 PM PST
10/06/00 - Reason for Outage: Hardware upgrades. New disks will be added to address database capacity limitations. Impact: User data stored in the SDM area of NWArchive will be inaccessible and no new data may be stored during the outage. Users with FTP access should not be affected. Please contact Paula Cowley (paula.cowley@pnl.gov) or Dan Adams (dan@pnl.gov) if you need additional information. - Dan Adams
SYSTEM Outage on NWmpp1, NWtest, NWecs1, Colony on Thursday, August 3, 1 PM PST - 3 PM PST
08/03/00 - We have had to delay the outage for a couple of hours because of other priorities. The outage will still take place, and up until 1:00 p.m today, you will be able to submit jobs. There will be a system outage in order to upgrade the Postgres Database to fix bugs causing qbank problems. You will not be able to submit jobs nor will jobs be able to start during this outage. Running jobs will continue to run. Thank you! - MSCF Support Staff
EMSL Network Outage - Tuesday, May 16, 6 PM PST to Wednesday, May 17, 6 AM PST
05/09/00 - The EMSL network will be shutdown to upgrade software on network equipment. Network services will not be available during the outage. This is one of three pre-scheduled yearly outages. Your jobs on NWMpp1, NWEcs1 and Colony will continue to run, but you will not be able to login to any of the systems during this period. If you have any questions, please contact mscf-support@emsl.pnl.gov. - Michael Valley, PNNL SP System Admin
GPFS Downtime - Thursday, April 13, 7 AM PST - 5 PM PST
04/05/00 - The GPFS outage will take place on Thursday, April 13th from 7:00 a.m. - 5:00 p.m. We will have to take GPFS (both /gpfs1 and /gpfs2) down. This down-time for GPFS will last a day. We will continue to leave the system up and jobs that DO NOT use GPFS will be able to run. Please setup your submission scripts NOT to use GPFS. Thanks! - Michael R. Valley, PNNL SP System Admin
NWEcs1 Outage - Friday, April 7, 10 AM PST - 12 PM PST
A temporary downtime has been set for NWEcs1 for 10:00 Friday morning. This is necessary in order to put in some debug code to troubleshoot the problem where the LoadLeveler API is causing the scheduler to die periodically. Also a new Qbank version (2.8) will be tested on NWEcs1 at this time, which contains code necessary for allowing a single bank database to serve all MSCF machines. We expect the system to be available again by noon on Friday. If you have any questions or concerns, contact Scott Jackson (Scott.Jackson@pnl.gov)
NWMpp1 Outage - Wednesday, March 29, 7 AM PST - 5 PM PST
03/24/00 - This outage is to 1) install fixes for the communication timeout problem and the GPFS initialization problem, 2) reconfigure login nodes and GPFS2 server nodes to return six nodes to the compute pool. This will give NWMpp1 a total of 484 compute nodes. NWMpp1 currently has 478 compute nodes. The login nodes will now be: sw1409, sw1411, sw1509, sw1511. The GPFS2 server nodes will be sw1401, sw1402, sw1403, sw1501, sw1502, sw1503. The downtime that is set for Thursday will prevent any jobs from running who's wallclock exceeds the downtime. Jobs will be held in "Idle" status until the "setdowntime" is removed. Any job that is running at the time of the outage will be lost. If you have any questions or concerns please send an email to mscf-support@emsl.pnl.gov.
NWEcs1 Outage - Monday, March 20, 7 AM PST - Friday, March 24, 8 AM PST
03/09/00 - This outage will be to install the latest PTFsets for AIX V4.3.3 and PSSP 3.1.1 Do not install the LoadLeveler PTF's since we will be testing the Mohunk version of LoadLeveler.
03/15/00 - The reservation that is set for Monday will prevent any jobs from running who's wallclock exceeds the downtime. Your job will be held in "Idle" status until the "reservation" is removed. Any job that is running at the time of the outage will be lost.
NWTest - Outage for reinstall - Tuesday, February 15 - Wednesday, March 8
02/16/00 - We are re-installing our test machine NWTest to the current level of NWMpp11. This will allow us to test IBM's migration code before we let that loose on NWMpp1. NWTest will be unavailable for the next 3 weeks, ETA 3/8/00. We'll send out a message when NWTest is again available for general use. - Ralph E. Wescott, Systems Analyst, EMSL MSCF
NWMpp1 - System Outage: Thursday, February 10, 7 a.m - 10 a.m.
01/31/00 - Purpose: To apply efixes for GPFS and Loadleveler. The downtime that is set for Feb 10 will prevent the running of any job whose wallclock exceeds the downtime. Your job will be held in "Idle" status until the "setdowntime" is removed. Any job that is running at the time of the outage will be lost. If you have any questions or concerns, please contacts us through our support request form or by email at mscf-support@emsl.pnl.gov. Thank you! - MSCF Support Staff