6248UP Fabric Interconnect broken fan

Started by Dieselboy, January 05, 2016, 02:55:20 AM

Previous topic - Next topic

Dieselboy

One of the fans, in one of the fan modules, within one of our 6248 FI's is making a noise like a 500cc single cylinder motorbike and full throttle. I pulled out the module while the unit was running to take a look at it and it has 2 fans inside. I pushed gently, the fans and one of them can move back and forth quite a lot (a lot being about half a centimetre / 5mm, may even be slightly more than this). The other fan is quite secure and doesnt move much at all - probably max. 1mm.

I'm not sure when this happened exactly as it was like it on the first day after Christmas.

The noise is so loud that it's now the loudest thing in the comms room and also the loudest thing in the hallway immediately outside the comms room. It's louder than all of the fans combined in the comms room  ;D

I am a bit worried about it causing damage or getting hot and friction causing a fire. So I removed the whole module last night but the TAC engineer stated that the FI cannot run with only the 1 remaining module as it (the FI) will overheat. So it's back in.

I've never seen this before. The FI's are 2.5 years old. There is no dust in the comms room. There's minimal dust on the fan blades as well. If I went to inspect and clean my own 2901 router's fan and it had this little dust on it, I'd probably not even bother.

What does keep happening though, is our comms room keeps getting warm. The A/C unit is quite beefy. When all this cooling issue started happening, I did the working out to see if the unit was being pushed. I realised that it's only cooling at around 50% of its capacity and we're pulling 16A to 18A on the main line powering the comms room's UPS depending on load from the VMs. We have around 90 vms (and only 40 people in the company...  :-X )

The cooling issue, is infrequent but I managed to trace it. To explain, the AC unit is set to a set temperature (21c). The room will cool down and the thermostat within the AC unit will stop cooling, the temp will rise a bit and the AC unit's thermostat should kick the unit back on and start to cool the air going through it. For some reason the issue occurs at the point where the thermostat is supposed to switch the unit on. The thermostat will be in the "ON" position but the AC unit is in the "OFF" position (best way I can describe it, but basically the temp in the room needs to be cooled, the thermo has clicked on but the AC unit either doesnt get the signal or ignores it). So the fix I have found is to get the remote for the AC, turn it up to a temperature hotter than the room (so the thermo clicks off) and then turn the temp back down to normal. The second time the thermo clicks on it normally kicks in the AC and all is well again for another 3 months.
The room will heat up to around 27c / 28c during the time where the thermostat is playing up.

Could the heat have ruined the lube in the bearings on this fan and caused it to fail prematurely?

I've been saying to management they need to get it fixed because I don't like going to the office in the weekend or at silly oclock just to switch the ac unit on and off. I've said it will reduce the lifetime of the equipment a number of times.

Reggle

You're getting a new fan via TAC, right? I would really insist on it. As for the temperature: that's not good. No telling what caused it, maybe even rough handling during the original transport, who knows. But I've known some batches of Cisco Catalyst switches fail fast when the temperature isn't right.

deanwebb

Agree on the part... bad parts happen... I learned that back in '95, doing Windows tech support. Does every hard drive fail? No. But a certain percent of drives fail right out of the starting gate... no way to avoid that. Swap out the defective part and carry on.

The temperature is a HUGE issue, as it absolutely can aggravate marginal working parts and bring them to a failure state. This fan may have been a canary in the coalmine - other parts may be receiving enough stress to bring them to a fail state, one at a time, in a process of attrition. There will be no sudden failure, most likely. But you want the HVAC in that room fixed as soon as possible, so that vendors don't refuse to replace gear because you didn't operate it properly. You may get the first part replaced, no questions asked. But subsequent replacement requests could raise a flag and cause a few more questions to be asked before you get the new stuff.

You are absolutely spot on in your assessment of the situation. Should your remote system fail or some other part of your workaround fail, then that DC can experience a slow-moving catastrophe, resulting in lost data and wrecked gear. Now, if that cost is less than getting proper HVAC work done, well... you'll have to lump it. But if fixing the HVAC is cheaper than repurchasing that gear, maybe now they should call the service contractor.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

There's $250,000AUD worth of equipment in that room. I've been shouting loud about the AC for almost 2 years. What's worse is that we're a company surviving on an old product which went to market many years ago. Our new flashy product which has been in development for many many years is about to hit the market very soon, and it's all in our comms room. There is no DC, the comms room is in our office.
You may recall I've posted something about our AC a year or so ago.

I'll send another email to the CEO, who is rarely here, to explain what's happened and the faulty AC could have attributed to it.

The new part arrived from TAC a few minutes ago and has been swapped out. All our stuff is under smartnet, which will be renewed this year.
What I have now noticed since I wanted to compare the new part to the defective part is that the bearings on the fans look different. I looked through the grille of the second FI we have and it looks like the 2nd FI has both fan modules which are matching the new part from TAC. May be the module which failed is known to fail and Cisco have redesigned this. It's just unlucky we have one FI which was built with the older fan modules. This is all an assumption of course.

I'll get some piccies of the 2 different units.

Regarding where you say: "But a certain percent of drives fail right out of the starting gate" - I agree with this. In my experience and through previous discussions about the same it does seem like if a part or unit is going to fail, then they usually do it very soon after receiving it new.

Dieselboy

attached pics
if you try and look through the grill of both FI's you might be able to see that the bottom FI has both fan modules matching the new part which is installed at FAN1 of the top FI. I had to reduce the size of the image so it might be a bit difficult to see now.

icecream-guy

Quote from: Dieselboy on January 05, 2016, 08:40:36 PM
There's $250,000AUD worth of equipment in that room. I've been shouting loud about the AC for almost 2 years.

I've lived through that ordeal where the computer room was 90+F on a cold winters day, with both doors open and room fans running, till they started forking out money left and right to replace very large Solaris servers that were failing _very_ frequently and didn't know why...
:professorcat:

My Moral Fibers have been cut.

Dieselboy

90F is a tad on the warm side...

We just had an AC bloke come round to take a look at the unit. I instantly gave the reception girl a funny look when I heard the AC guy go into the comms room and ask if the unit was working. I mean, it was actively on and blowing ice cold air. It's approaching 40c outside (100F)
:zomgwtfbbq:

A few points he mentioned though. The instant you install one of these small AC units in a comms room you invalidate your warranty.
He then went on to talk about installing 2 units which I am all for if money were no object. That way there's a backup unit if one goes tits up again. We should spec it out so that we can run full time on one unit. Another benefit is that the load would be shared between them and I could probably run the thermostat at a slightly higher temp with the room being just as cool at the temp probe.
However the AC guy went on to explain that we would run one unit in the day and the other one at night so each unit had a break. Because just like people they need a rest too.
:zomgwtfbbq:

I think he's getting a bit muddled up with MTBF of hardware and organic matter.

deanwebb

Actually, HVAC does work best if it gets a periodic rest. I live in Texas, where I burned out a unit during one really bad heat wave.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

Hmmm Was your unit not supposed to have continuous run time?
I'm gonna start a new thread so I can mention how I'm getting this cooling issue resolved. I do have some more Q's.