Total Pageviews

Wednesday, October 27, 2010

Say No to Manual Fence on GFS Production Environment

Why should manual fencing be avoided in a production cluster?

Global File System (GFS) manual fencing should be avoided in production clusters. It is meant to be used for testing purposes only when an appropriate fencing device is not yet available. Red Hat recommends a network power switch or a fiber channel switch fencing device for production clusters to guarantee filesystem integrity.

Outlined below is a scenario that explains how manual fencing might lead to filesystem corruption:

 1. A node stops sending heartbeats long enough to be dropped from the cluster but has not panicked or the hardware has not failed. There are number of ways this could happen: faulty network switch, gulm hanging while writing to syslog, rogue application on the system locking out other applications,etc.

   2. The fencing of the node is initiated in the cluster by one of the other members. Fence_manual is called, lock manager operations are put on hold until the fencing operation is complete. (NOTE: Existing locks are still valid and I/O still continues for those activities not requiring additional lock requests.)

   3. Administrator sees the fence_manual and immediately enters fence_ack_manual to the get cluster running again, prior to checking on the status of the failed node.

   4. Journals for the fenced node are replayed and locks cleared for those entries so other operations can continue.

   5. Fenced node continues to do read/write operations based on its last lock requests. File system is now corrupt.

No comments: