mithriltabby: Serene silver tabby (Zort)
[personal profile] mithriltabby
So a couple of RAID controllers go out on a couple of the ClearCase servers yesterday, early in the day. About a day later, we finally get the message that servers are back online after having had the controllers replaced and the drives re-checked. I get in around 13:00 from a dental appointment, check my system, and find that the main tree in which my team is doing its work is behaving irrationally. I report the behavior to our support staff immediately, and an hour and a half later get a note that it’s being restored from backup and will be available in four hours, by which time the work day is over. That’s two days of work lost, when we could’ve restored a CVS repository from backup onto a spare machine in what, an hour? And been able to work from our checked-out files the whole time?

A couple of RAID controllers?

Date: 2003-05-20 10:50 pm (UTC)
From: (Anonymous)
Who do you buy your equipment from? If one fails its an anomaly, but two is faulty equipment or power.

I read a white paper on CVS in large groups once. It talked about 14 people. If your company had 14 or less developers, move to CVS. My main concern about CVS is drift. How do you keep everyone on the same page? Central servers... can't do that though, then we would wind up in the same position when the hardware fails.

Your Silver Bullet (tm) is looking tarnished and poorly aimed.

Your silver bullet is CVS

Date: 2003-05-20 11:08 pm (UTC)
From: (Anonymous)
From a distance, scanning your Journal, that your current employer has no plan. You say if one server goes down ClearCase is useless? My understanding of ClearCase is that each VOB, Versioned Object Base, is a separate repository. When I am forced to do ClearCase work, each VOB is a project. Occasionally, when a company is large enough, tools are given their own repository. If you need more than 6 VOBs to work on a project, fire your Configuration Management people, now!!!
Most CM people are over payed and wield enormous amounts of power. That combined with stupidity is dangerous.
On the other hand if your CM people are ignored, your company is getting what it deserves. Actually, its getting off easy.


~burnt

how much data?

Date: 2003-05-21 11:47 am (UTC)
From: [identity profile] chrisla.livejournal.com
Your friendly employer stores about 2 terabytes of data for clearcase. What type of tape systems are you using that you can do a full restore of 2TB in an hour? Is this McRestore?

How many tape drives and libraries has your employer been budgeted to buy? Is it possible more were requested, and were turned down for budget restraints? Not all problems can be solved with a $1k Linux box.

Quit having such a narrow view.

data

Date: 2003-05-21 01:09 pm (UTC)
From: [identity profile] chrisla.livejournal.com
Indeed it is not stored all in one place, the outage spanned two different arrays, on separate power circuits, in separate cabinets. The hardware has some redundancy, but to deal with that level of failure, you need to start thinking about separate independent sets of hardware, with the data replicated to a second site.

The restore process is also slowed b/c we have to multiplex several backup sessions onto a given tape/drive when doing backups. When you go to do a restore, the backup software has to spin thru much more of the tape than is actually needed to get back the given dataset. The nature of the data being written in multiple steams to a single tape also means it has to spend more time stopping/seeking/rewinding as it reads as well.


I hope you are right on the budget front, right now said executives are willing to consider major improvements. However at times their memories are short.

As for a re-org, we are not just sitting accepting the status qou. We have some sweeping plans in place involving new hardware that was purchased and some re-orgs of how the data is structured in clearcase. This has not been widely publicized b/c it will involve some real work-flow habit changes from engineering as well. We want to have a clear, well thought out plan before these re-structures are presented to the engineers. I agree, as an engineer, it is totally unacceptable that you were unable to work for a day +, but do not think no-one working to try to improve this.

I'd be glad to show you the new hardware and show how we hope to improve things with the new layout if you are interested. (2nd floor near the datacenter)

October 2025

S M T W T F S
   1234
5678 91011
12131415161718
19202122232425
262728293031 

Most Popular Tags

Style Credit

  • Style: Midnight for Heads Up by momijizuakmori

Expand Cut Tags

No cut tags
Page generated Jan. 28th, 2026 01:06 pm
Powered by Dreamwidth Studios