IBM BladeCenter Fundamentals

Download Report

Transcript IBM BladeCenter Fundamentals

Installation and troubleshooting
overview
5.3
Unit objectives
After completing this unit, you should be able to:
• Identify the BladeCenter components used to provide PD information
• List the planning elements required for the BladeCenter management network
• Select the functions available to modify firmware settings
• List the blade server indicators and Light Path Components
• Select the steps appropriate in diagnosing blade server hardware failures
• Identify the utility to use in displaying BladeCenter component health
2
Best practices
• Best practices
• Troubleshooting and problem determination
• BladeCenter management interfaces
• Firmware updates and settings
• Information gathering
• IBM BladeCenter support resources
3
BladeCenter chassis questions: Requirements
• Given your specific needs, what is the best BladeCenter solution (in
terms of components) necessary to meet your requirements?
• Define the networking and SAN requirements for your BladeCenter
environment based on your existing infrastructure, including fault
tolerance, throughput and interoperability.
• Do you plan on having a separate Management LAN and production
LAN? What is the advantage/disadvantage of this environment?
• Are all of the components being installed in the BladeCenter chassis on
the ServerProven list?
• Is this BladeCenter chassis to be deployed locally
or in a remote location?
4
Blade server considerations: Questions
• Is the blade server at the latest firmware level?
If not, what method of applying the latest
firmware updates are you going to implement?
• Besides the BIOS, what other firmware
updates are needed for the blade server?
• What operating system are you going to put on
the blade server. How do I find out if this OS is
supported on the blade server?
• What are the different deployment methods for
operating system installations, and which
method makes the most sense in my
environment?
• What performance requirements are needed
out of my blade server? Based upon these
requirements, which model best fits my
business needs?
5
BladeCenter chassis questions: Power
• Do you understand the necessary power requirements for a
given BladeCenter solution?
• Will your BladeCenter chassis be connected to either a frontend or high-density front-end rack PDU?
• How many blade servers are in the chassis and will that
impact oversubscription of the power domains?
• Do you have the correct electrical connectors to power your
new BladeCenters and their PDUs?
6
Cooling questions
• Are the systems on a raised floor?
• How many BTUs am I generating when my installation is
complete?
• What are the power requirements for the new systems?
• Are there plans to grow in the future?
7
Troubleshooting and problem determination
• Best practices
• Troubleshooting and problem determination
• BladeCenter management interfaces
• Firmware updates and settings
• Information gathering
• IBM BladeCenter support resources
8
Problem determination: Information gathering
• Due to the variety of hardware and software combinations that can be
encountered, use the following information to assist you in problem
determination. If possible, have this information available when
requesting assistance from Service Support and Engineering functions.
– Machine type and model
– Microprocessor or hard disk upgrades
– Failure symptom
•
•
•
•
•
•
Do diagnostics fail?
What, when, where, single, or multiple systems?
Is the failure repeatable?
Has this configuration ever worked?
If it has been working, what changes were made prior to it failing?
Is this the original reported failure?
– Diagnostics version — type and version level
– Hardware configuration
• Print (print screen) configuration currently in use
• BIOS level
– Operating system software — type and version level
9
Blade servers: Diagnostics tools
• Light Path Diagnostics
• Standalone diagnostics
• Diagnostics by PC Doctor
– Test results are stored in a test log
– Management Module event logs contain system status messages from the
blade server service processor and can be:
• Viewed
• Saved to diskette
• Printed
• Attached to e-mail alerts
– Standard log is a summary of tests
– Press <Tab> while viewing the test log
• Power On Self Test (POST) beep codes
• Unified Extensible Firmware Interface (UEFI)
– Elimination of Beep Codes
– Advanced logging and firmware control
• Command-line interface (CLI)
10
IBM Blade Server: Front panel LEDs HS22 example
IBM HS22 Blade Server Front
Panel indicators and controls
HS22 Blade Server Front Panel
11
IBM Blade Server: System board diagnostic indicators
HS22 example
• IBM HS22 Blade server system board example
– Memory, processor, and disk Indicators
– Light Path Panel
IBM Blade Server HS22 System Board Indicators
HS22 System Board Light Path Panel
12
IBM Blade Server: Front panel LEDs LS22 example
LS22 Blade Server Front Panel Controls and Indicators
IBM LS22 Blade Server Front Panel
13
IBM Blade Server: System board diagnostic
indicators LS22 example
LS22 Blade Server System Board Light Path Panel
IBM LS22 Blade
Server System Board
14
IBM Blade Server: Diagnostics tools
• Light Path Diagnostics
• Press F2 at POST to invoke standalone diagnostics
• Diagnostics by PC Doctor
– Test results are stored in a test log
– Management Module event logs contain system status messages
from the blade server service processor and can be:
•
•
•
•
Viewed
Saved to diskette
Printed
Attached to e-mail alerts
– Standard log is a summary of tests
– Press <Tab> while viewing the test log
• Power On Self Test (POST) beep codes
• Real time diagnostics
• Command-line interface (CLI)
15
Blade server: Basic input/output system (BIOS)
• Blade server BIOS
–
–
–
–
Menu-driven setup
Settings for configuration and performance
Set, change, delete (IRQ, date and time, and Passwords)
Advanced settings for specific needs (for example, memory, CPU, PCI
bus and BMC)
– BIOS defaults
• Flash diskette
• BIOS updates for host and devices CD-ROM BIOS/firmware
updates and configuration for host and devices
• BIOS system board jumpers or switches
– BIOS boot selection
– Password override
– Wake on LAN enablement
16
UEFI: Unified Extensible Firmware Interface (1 of 3)
• The next generation of BIOS
• Allows OSs to take full advantage of the hardware
– Architecture independent
– Modular
• 64-bit code architecture
• 16 TB of memory can be addressed
• More functionality
– Adapter vendors can add more features in their options (for example, IPv6)
– Design allows faster updates as new features are introduced
– More adaptors can be installed and used simultaneously
– Fully backwards compatible with legacy BIOS
• Better user interface
– Replaces ctrl key sequences with a more intuitive human interface
– Moves adaptor and iSCSI configuration into F1 setup
– Creates human readable event logs
• Easier management
– Eliminates “beep” codes; all errors can now be covered by Light Path
– Reduces the number of error messages and eliminates out-dated errors
– Can be managed both in-band and out of band
17
UEFI: Unified Extensible Firmware Interface (2 of 3)
Tomorrow’s update and
configuration on systems
Today’s update and
configuration on systems
xFlash
xFlash
&
ASU
&
ASU
Configuration
Configuration
RSAII
Diags
BIOS
BMC
Pb
DSA
IMM
UEFI
18
UEFI: Unified Extensible Firmware Interface (3 of 3)
UEFI versus BIOS
UEFI
BIOS
64 bit code architecture: 16 TB of memory can be
addressed
16 bit code architecture: Only 1MB of memory can be
addressed.
Eliminates Code Space Constraints. Adapter Option
ROMs can be loaded anywhere in memory with no size
restrictions.
Adapter Vendors must fit all option code into a shared
128K. Limits the number of adapters that can be
effectively installed.
Adapter vendors are free to add function. i.e. IPV6
Vendors are limited in the function they can provide in
the option ROM.
UEFI defines a Human Interface that is being extended to
Adapter Vendors.
Cryptic Ctrl Key sequences required for configuring
Adapters.
iSCSI Configuration is in F1 Setup and consolidated in to
ASU.
iSCSI Configuration requires separate tool.
Elimination of Beep Codes – All Errors covered by Light
Path. Reduction in Number of Error Messages.
Multiple Beep Codes for fundamental failures.
Adapter Configuration can move into F1 Setup.
Eliminates Ctrl Key sequences for configuring Adapters.
Advanced Settings Utility (ASU) has partial coverage of
F1 Settings
In & Out of Band UEFI Updates. Settings accessed Out of
Band via ASU and the IMM.
In-Band only updates via DOS, wFlash, or lFlash.
UEFI Event codes available out of band. Human readable
Event logs in F1 Setup
Numerous Legacy POST Errors.
19
Blade server: Integrated Management Module (IMM)
• Integrated Management Module (IMM)
– Replacement for BMC
– LAN over USB
– OS drivers included in Windows and Linux
20
Blade server six system states
System State
Data Gathering
Data Analysis
1
There is no AC
Visual
PDSG
2
There is AC power but no DC
Advanced Management Module
(AMM) & (IMM)
Light Path
System event log
3
There is AC and DC power but the system fails
to complete post
Checkpoint codes
F1 and F2
Beep codes (prior to UEFI)
Adapter BIOS messages
PDSG
Retain tips
IBM Support Web site
4
There is AC and DC power, the system
completes POST but the NOS fails to start
loading
F2 diagnostics
PDSG
Retain tips
There is AC and DC power, the system
completes POST but the NOS fails to complete
NOS boot messages
'Blue Screen'
5
NOS Vendor messages
21
Advanced Management Modules (AMM): Overview
• The Management Module stores all event and error information for the BladeCenter
• The Management Module configuration data is stored both in itself and on the midplane
– To reset the IP address back to the default settings, press and hold the IP reset button for 3
seconds or less
Power-on LEDS Activity LEDS
Serial Console
Connector
RJ45
Error LEDS
Release handle
Video Connector
10/100 Ethernet
Connector
RJ45
Port Link LED
Port Activity LED
Advanced Management Module LEDS
USB Dual Stack
Pin-hole Reset
MAC
Address
22
Recovering Management Module TCP/IP address
• MM configuration data is stored in the midplane
– To reset a TCP/IP address only:
• Remove the cable from the MM Ethernet port
• Press and hold the IP reset button for 3 seconds or less
– TCP/IP address will reset to 192.168.70.125/255.255.255.0
– Simply replacing the MM will cause the replacement MM to adopt the
same values as the original MM
• PERFORM ALL RESET STEPS BEFORE REPLACING THE MM
23
Management Module full reset: Factory defaults
• MM configuration data is stored in the midplane
– To force a complete MM reset (including password):
• Remove the cable from the MM ethernet port
• Press and hold the IP reset button for 5 seconds
• Release the IP reset button for 5 seconds
• Press and hold the IP reset button for 10 seconds
– TCP/IP address will be reset to 192.168.70.125/255.255.255.0
– All IDs and passwords will be deleted (except USERID/PASSW0RD)
– Simply replacing the MM will cause the replacement MM to adopt the
same values as the original MM
• PERFORM ALL RESET STEPS BEFOIRE REPLACING THE MM
24
Advanced management event log
25
Problem determination: Blade server example
• Example of a memory DIMM problem
– Display of BladeCenter Front Panel LEDs
Management Module web interface indicating error LEDs
26
Problem determination: Blade server example
• Example of a memory DIMM problem
– Display of the Blade server front panel LEDs
Advanced Management Module Blade server LEDs
27
Problem determination: Blade server example
• Example of a memory DIMM problem
– Display of the BladeCenter Event Log
Advanced Management Module Event Log
28
Problem determination: Blade server example
• Using the IBM Problem Determination guide - IBM
BladeCenter HS21
– Locate the error symptom code in the log (in this example: 289)
– Match the table entry to the code
Check POST error log
for error message 289:
29
Problem determination: Blade server example
• Consult the IBM Installation Guide for the HS21
– Proper DIMM installation procedure
HS21 DIMM Installation slot and order
30
Problem determination: Blade server example
• Verifying fix and proper operation
AMM Status Display and Event Log
31
Problem determination: Blade servers
• What do you do if:
– Blade server powered down for no apparent reason
– Blade server does not power on, the system-error LED on the
BladeCenter system-LED panel is lit, the blade error LED on the blade
server LED panel is lit, and the system-error log contains the following
message: ″CPUs Mismatched″
– Some components do not report environmental status (temperature,
voltage)
– Switching KVM control between blade servers gives USB device error
32
Ethernet switch modules: Addressing issues
• What do you do if:
–
–
–
–
–
You have duplicate IP address reported on the ESM
You have duplicate IP address reported on the blade server
You have a native VLAN mismatch reported on the ESM
There are connection problems to the blade servers
The DHCP server uses up all IP addresses and the blade server
still cannot get an address
33
Problem determination: Ethernet switch I/O modules
• Hardware failures
• Not very common
– On MM, look under I/O Module Tasks ->
Power/Restart to see diagnostic code after reboot.
Also look at fault LED on the Ethernet Switch
Module
• Software Failures
– Not very common
– As with all products, software bugs do exist
– Reference the latest code readme file for a list of
resolved bugs with each release of code
• Misconfiguration of Ethernet Switch Module or
other component
– This is the most common issue encountered
– Often requires close cooperation between different
administrative groups to resolve
34
Ethernet switch modules: Configuration issues
• Most common issue encountered
– May be with the Ethernet Switch Module, a device upstream or the
server within the BladeCenter
– May also be misconfiguration on the Management Module
• Same tools used to troubleshoot configuration issues can also
be used to help isolate broken hardware and software bugs
• Usually requires close cooperation between network
administrators and server administrators
• Often helps to have special tools (for example, network sniffer)
to understand and resolve problem
35
Ethernet switch modules: Basic rules
• Do not attach cables to the ESM until both sides of the connection are
configured
• Do not put the blade servers on the VLAN that the ESM uses for its
management VLAN interface
• Make sure the ESM firmware (IOS) code is upgraded
• Decide the ESM management path (via Management Module or ESM uplinks)
and configure for it
36
BladeCenter management interfaces
• Best practices
• Troubleshooting and problem determination
• BladeCenter management interfaces
• Firmware updates and settings
• Information gathering
• IBM BladeCenter support resources
37
BladeCenter AMM: System status screen
Navigation
menu
Main
information
window
38
System Event Log (SEL) screen
• This screen shows event history of the BladeCenter
39
Hardware Vital Product Data (VPD)
• This screen shows information relating to the hardware in the
BladeCenter
40
Rules for I/O module management
• In-band management
– Use the AMM path to an I/O module
• Provides centralized management of all I/O modules
– All activities and reporting is through a single Ethernet port
– Makes LAN configuration easier
• Requires MM and all I/O modules to be on the same IP subnet
• Out-of-band management
– Requires enablement of external management over all ports
• May require management VLAN configuration
• Access will involve many Ethernet ports
• I/O module need not be on the same IP subnet as the MM
– If subnets are different, AMM path to I/O module is unavailable
41
I/O module tasks: Close up
42
I/O module tasks: Advanced switch management
43
Ethernet switch I/O module Web interface
44
CIGESM Web interface
45
Nortel ESM Web interface
46
Fibre Channel switch module Web interface
• SAN Utility (QLogic)
– Full Function GUI
• SAN Browser (Qlogic)
– Limited functionality
• Switch Explorer (Brocade)
– Limited functionality
47
Firmware updates and settings
• Best practices
• Troubleshooting and problem determination
• BladeCenter management interfaces
• Firmware updates and settings
• Information gathering
• IBM BladeCenter support resources
48
UpdateXpress CD-ROM package
• UpdateXpress
– Bootable CD-ROM
– Supports maintenance of system firmware and Windows device drivers
• Automatically detects current device-driver and firmware levels
• Gives the option of selecting specific upgrades or allowing UpdateXpress to
update all of the system levels it detected as needing upgrades
• Can be installed using local DVD or over network using the AMM
49
UpdateXpress firmware update scripts
• UpdateXpress Firmware Update Scripts for BladeCenter
(UXBC)
– Process that enables firmware updates to be run in a remote,
unattended fashion
• Requires a management station and supporting software
– Windows or Linux OS
– FTP and TFTP servers somewhere on the management LAN
– UXBC discovery and deployment components
– For more information, see
– http://www-03.ibm.com/systems/management/uxs.html
50
IBM preboot dynamic system analysis
• Provides problem isolation,
configuration analysis, error
log collection
– Collects information about:
• System configuration
• Network interfaces and settings
• Installed hardware
• Light path diagnostics status
• Service processor status and
configuration
• Vital product data, firmware,
and UEFI configuration
• Hard disk drive health
51
Advanced settings utility
• Enables the user to modify firmware settings from the
command line
– Supported on multiple operating system platforms
– Enables remote changes to POST and BIOS settings
• Does not require F1 access to a console session
–
–
–
–
Supports scripting through a batch processing mode
Does not update any of the firmware code
For more information, see
http://www304.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5
000008&lndocid=MIGR-55021
52
Information gathering
• Best practices
• Troubleshooting and problem determination
• BladeCenter management interfaces
• Firmware updates and settings
• Information gathering
• IBM BladeCenter support resources
53
Data gathering
• Read the BladeCenter data collection guide
– Contains details of what logs and information are needed for
escalations
– Contains a step-by-step guide on how the logs are collected
– For more information, see
– http://www304.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=SE
RV-BLADE&brandind=5000008
54
Gathering information from blade servers
• Blade server logs can be gathered within the operating system
– Use the following table to determine what utility to use
Type of blade server
Operating system
Type of gathering utility:
HS Series
Windows
Dynamic System
Analysis
HS Series
Linux
Dynamic System
Analysis
LS Series
Windows
Dynamic System
Analysis
LS Series
Linux
Dynamic System
Analyses
SNAP is built into AIX and SNAP for Linux on Power can be found at:
http://techsupport.services.ibm.com/server/lopdiags.
JS Series
Linux
SNAP
55
Gathering information from I/O switch modules
• Logs from a Brocade, Cisco, BNT or QLogic switch module
can be captured within the switch interface
– Enable capture text/console logging within the telnet application
– Login to the switch using telnet
– Issue the command from the table below
Type of switch:
Command:
Brocade
showSupport
Cisco
show tech-support
Nortel
maint/tsdmp
Qlogic
support show
56
IBM BladeCenter support resources
• Best practices
• Troubleshooting and problem determination
• BladeCenter management interfaces
• Firmware updates and settings
• Information gathering
• IBM BladeCenter support resources
57
IBM support Web site
• New central Web site for all server products:
http://www-304.ibm.com/systems/support/
– Select BladeCenter from the drop-down menu
58
Documentation
• Hardware Maintenance Manual
– Available electronically (Adobe Acrobat .PDF format) from the IBM
support Web site
• Primary support document for diagnostics and troubleshooting
• User’s Guide, Installation Guide
– System documentation that ships with the BladeCenter and with
options such as blade servers and switch modules
• Useful for confirming shipping group contents (missing parts, and so on)
and initial customer setup
59
IBM Blade Server references
• IBM BladeCenter Products and Technology
– http://www.redbooks.ibm.com/cgi-bin/searchsite.cgi?query=bladecenter
• IBM ServerProven – Compatibility for BladeCenter Products
– http://www-03.ibm.com/servers/eserver/serverproven/compat/us/
• System x Reference (xREF)
– http://www.redbooks.ibm.com/xref/usxref.pdf
• Intel Products
– http://www.intel.com/products/server/processors/index.htm
• AMD Products
– http://www.amd.com/us/products/server/Pages/server.aspx
60
Key words
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Advanced Management Module (AMM)
Alternating Current (AC)
Basic Input/Output System (BIOS)
British thermal unit (BTU)
Central Processing Unit (CPU)
Cisco Intelligent Gigabit Ethernet Switch Module
(CIGESM)
Command-line interface (CLI)
Compact Disc Read-Only Memory (CD-ROM)
Dynamic Host Configuration Protocol (DHCP)
Ethernet switch modules (ESM)
Fibre Channel Switch Module (FSCM)
File Transfer Protocol (FTP)
Graphical User Interface (GUI)
IBM BladeCenter E (Enterprise)
IBM BladeCenter H (High Performance)
IBM BladeCenter HT (High Performance Telco)
IBM BladeCenter S (Simplification)
IBM BladeCenter T (Telco)
Integrated Management Module (IMM)
Input-output (I/O)
Internet Protocol (IP)
Interrupt Request (IRQ)
Jumper (J)
Keyboard, Video, and Mouse (KVM)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Local-Area Network (LAN)
Management Module (MM)
Non-Maskable Interrupt (NMI)
Operating System (OS)
Peripheral Component Interconnect (PCI)
Power Distribution Unit (PDU)
Power On Self Test (POST)
Remote Supervisor Adapter II (RSA II)
Secure Sockets Layer (SSL)
Serial over LAN (SoL)
Servcie Pack (SP)
Service Support Representative ( SSR )
Simple Mail Transfer Protocol (SMTP)
Simple Network Management Protocol (SNMP)
Storage Area Network (SAN)
System Event Log (SEL)
Transmission Control Protocol (TCP)
Trivial File Transfer Protocol (TFTP)
Unified Extensible Firmware Interface (UEFI)
UpdateXpress Firmware Update Scripts for BladeCenter
(UXBC)
Virtual Local Area Network (VLAN)
Vital Product Data (VPD)
Volt (V)
Watt (W)
61
Checkpoint (1 of 2)
1. The _______________________ stores all major event and error
information for the BladeCenter and is the starting point for PD.
a. Ethernet Switch Module (ESM)
b. AMM
c. BIOS
d. Blade Server operating system log
2. True/False: In planning the BladeCenter management network,
bandwidth is the primary consideration.
3. The __________ enables the user to modify firmware settings from
the command line.
4. True/False: While AMM management can be done through a Web
interface, all switch modules must be configured using command
line.
62
Checkpoint solutions (1 of 2)
1. The _______________________ stores all major event and error information for
the BladeCenter and is the starting point for PD.
a.
b.
c.
d.
Ethernet Switch Module (ESM)
AMM
BIOS
Blade Server operating system log
Answer: b
2. True/False: In planning the BladeCenter management network, bandwidth is the
primary consideration.
Answer: False
3. The __________ enables the user to modify firmware settings from the command
line.
Answer: Advanced Settings Utility (ASU)
4. True/False: While AMM management can be done through a Web interface, all
switch modules must be configured using command line.
Answer: False
63
Checkpoint (2 of 2)
5. Select the correct statement regarding Blade Server status
indicators.
a. Memory and processor LEDs are on the Blade Server front panel
b. All Blade Server status LEDs are on the Light Path diagnostics panel
c. Blade Server status and error LEDs are on the Front Panel, Control Panel
and adjacent to components on the system board
d. Light Path status and error indicators require the Blade to be powered on
6. True/False: The UEFI is a functional replacement for legacy BIOS
7. True/False: To diagnose a Blade Server hardware problem, the first
step to take would be to remove the Blade from the chassis and
check the system board LEDs.
8. True/False: As a rule, power consumption is directly related to
resultant heat output.
9. Which function should be used to view Service Processor
configuration and hard disk drive health?
a. AMM Event Log
b. PreBoot DSA
c. AMM Monitor status page
64
Checkpoint solutions (2 of 2)
5. Select the correct statement regarding Blade Server status indicators.
a. Memory and processor LEDs are on the Blade Server front panel
b. All Blade Server status LEDs are on the Light Path diagnostics panel
c. Blade Server status and error LEDs are on the Front Panel, Control Panel and
adjacent to components on the system board
d. Light Path status and error indicators require the Blade to be powered on
Answer: c
6. True/False: The UEFI is a functional replacement for legacy BIOS
Answer: True
7. True/False: To diagnose a Blade Server hardware problem, the first step to take
would be to remove the Blade from the chassis and check the system board
LEDs.
Answer: False
8. True/False: As a rule, power consumption is directly related to resultant heat
output.
Answer: True
8. Which function should be used to view Service Processor configuration and hard
disk drive health?
a. AMM Event Log
b. PreBoot DSA
c. AMM Monitor status page
Answer: b
65
Unit summary
Having completed this unit, you should be able to:
• Identify the BladeCenter components used to provide PD information
• List the planning elements required for the BladeCenter management network
• Select the functions available to modify firmware settings
• List the blade server indicators and Light Path Components
• Select the steps appropriate in diagnosing blade server hardware failures
• Identify the utility to use in displaying BladeCenter component health
66