Transcript Document

Avid System Monitor
Ed Harper
November 2010
1
Avid System Monitoring overview
•
Avid System Monitor delivers Enterprise wide monitoring solution for Avid systems and
infrastructure switches
Overview
• Single GUI visibility to whole infrastructure
• Standards based polling and event notification; SNMP,
IP, HTTP
• Tightly integrated with Avid Health Monitor
• Integrate with enterprise management
Devices Managed
• ISIS 5000 & 7000; System Director
• Interplay; Media Indexer, Look Up Server, Interplay
Engine, Capture, ASF services, Capture
• SNMP Network Switches (Cisco, Foundry, Force10)
Capabilities
• Real Time Statistics, thresholds
• Events, Alarms, Notifications (email)
• Historical statistics
• Surveillance Dashboard
• Flexible reporting tools
Benefits
• Proactive real time status and statistics
- Identify anomalies, prevent outages
- System wide diagnostic tools, faster restoration
• Trend analysis
2
What it is
•
•
•
•
•
•
A tool to increase the system availability by identifying issues in real time
A tool to help identify potential problems in a system as they are occurring
A single tool for monitoring all necessary components of the “system”, including Avid gear,
network infrastructure, 3rd party devices
A tool that collects performance data over time so that it can be graphed (and trends
identified)
A tool that will continually evolve to identify known problems within a system (after the
knowledge of those problems have been learned during Code Blues, etc)
A window into specific state of the Avid & selected infrastructure system components at a
given point in time. It also provides enough flexibility for customers to refine and fine tune
the tool’s outputs once the basic functions are mastered.
3
Overview
•
Avid System Monitor delivers enterprise solution monitoring for Avid systems and
infrastructure
–
–
–
–
Pro-active system health and status monitoring
Statistics gathering, graphing and thresholds
Event logging, intelligent alarm processing and notification
Dashboard views showing outages and availability
•
–
Simple drill down to isolate issues
Standards based
•
SNMP, HTTP & IP port status
– Avid Monitoring Gateway
service installed on Framework
(ASF) enabled devices to
provide visibility to Avid
System Monitor via HTTP
4
Monitoring components
Monitored Node
Agents
Monitoring Server
Recommended platform SR2500
GUI, SNMP & HTTP collection
SQL Database
Java (JDK) Environment
• Interplay Engine
• Stream Server
• Capture
• Media Indexer
• Interplay Lookup Service (LUS)
• ISIS 7000 System Director
•ISIS 5000
3rd Party
• Cisco switches
• Foundry Switches
• Force10
Real Time Audit
Agentless
• Avid Service Framework
• Provides time sync
• AirSpeed, AirSpeed Multi Stream
• Capture Manager
• DNS, DHCP services
• Time Sync
5
Monitoring Environment
•
Monitored Avid Services & Devices
–
Detailed monitoring including status, statistics etc.
•
•
•
•
•
•
Real-time inventory
–
Device up/down status without detailed monitoring
•
•
Avid Service Framework (ASF)
– Media Indexer (MI)
– ASF Lookup Service
Interplay Engine
Stream Server
Interplay Capture
ISIS 5000 & 7000: System Director
Workflow Engine, iNews FTS, Workstation Service , Time
Sync service, Multicast repeater, LowRes Encoder
3rd Party Elements
– Windows services; DNS, DHCP etc
– Network Switches
•
Cisco, Foundry, Force10
6
Dashboard
•
•
Single screen view with Intelligent
grouping of devices & domains
High level status
–
–
–
–
•
Alarms
Notifications
Node Status
Resource Graphs
Click on any device group to
automatically filter information for
selected devices
7
Events & Alarms
•
Extensive Event Logging
–
–
–
–
–
•
Severity, source etc
Acknowledgement
Search
Fine grain event details
Correlating up/restore
Alarms
–
Flexible rules to allow event
aggregation in alarm view
to count multiple
occurrences of same event
•
•
•
•
•
–
Severity
Last time of event
Count occurrences
Link to event details
Option to auto-clean
events
Operator Instructions
specific to alarm & device
type
8
Notifications
•
Flexible notification to email
–
•
Individuals or groups
Automatic Escalation
–
Escalation to higher level group if notification is not
acknowledged within certain time
•
•
Example; Minor event sent to Ops team, if unacknowledged
for 20 minutes raised priority to Major and issues notification
to Management team
Notification logging, with timestamps including response
time
9
Statistics & Charts
•
•
Historical statistics gathering, trending, charts
Thresholds set to trigger events and notifications on
‘interesting’ conditions
–
Specifically tuned to Avid components, based on real world
experience
10
Threshold Event Notification
•
Flexible Threshold engine
–
–
–
Configurable on any counter in the system
Extensive pre-programmed thresholds provided in Avid monitoring package
Simple process to add customer specific threshold
Media Indexer
Media Files
Admin configurable
trigger levels
11
Threshold Configuration
•
Custom configuration of Threshold Event
–
Any counter value collected by OpenNMS
–
Type; High, Low, Relative Change, Absolute Change
–
Datasource; Entity to collect counter data (graph properties)
–
Datasource Type; Node or interface
–
Datasource Label; String displayed in event
–
Value; Threshold value
–
Re-arm; Reset/ Cleared value
–
Trigger: Number of times the threshold must be broken to create an event
12
Node View
•
Single screen dashboard per
node
–
Current Status
–
Availability; system and
individual services
–
Notifications, Recent Events,
Recent Outages
13
Outages & Availability
• Current Outages
– Node or Service down
•
Calculated 30-Day Availability
–
Color Coded
• Grouped by Device / Service Type
– Click to drill down
14
Surveillance View: Flexible Grouping
•
Current Outages by;
–
Device Type
–
Workgroup or location
• Grouping by
– Service
– Category
– Simple customization
15
Node Discovery
•
Configure OpenNMS to discover devices and services on specific IP address or range
–
Automated capability query of generic IP, SNMP and Avid specific services & device capabilities
–
Add device names to nodes for readability if desired
•
•
IP address and DNS names displayed by default
Automated capabilities scan every 24 hours
16
Network Switch Monitoring
•
SNMP monitoring and statistics gathering for Cisco, Foundry & Force10 infrastructure Zone 2 switches
SNMP
• Link Up
• Link Down
Network
• Spanning Tree Topology Change
• Bandwidth Utilization
Thermal
• Max temp exceeded
System
• Memory utilization
• Processor utilization
Cisco
• Configuration change
Foundry
• Startup config change
• Running config change
• Telnet login / logout
17
Maps
•
OpenNMS provides mapping tool with device status
–
Multiple maps to allow views for LAN, editors etc
–
Link discovery find node connectivity
•
Not all links shown correctly; ISIS switches not manageable so devices appear connected to adjacent
switch
18
Proving it’s Value (a real field example)
•
Phased Roll-out
–
•
Monitoring SNMP switches (only)
Customer Reported AirSpeed “Slow Down”
–
Avid CS / Systems Engineers queried OpenNMS remotely
–
Pulled switch bandwidth utilization
–
•
Switches operating correctly
•
Within a few minutes troubleshooting team moved on to investigate specific devices
Without OpenNMS proving switch operation required access labor intensive process of monitoring scripts and
driving traffic loads
•
Time consuming ~ 1 day to prove switches
Faster resolution
Greater customer satisfaction
19
Example
•
•
•
•
Memory Utilization on Interplay Media Indexer
Charts show steady consumption of server RAM memory
during load test
Performance impacted as memory maxed out
Thresholds provide notification when x% exceeded
20
Server & System Requirements
Category
Requirement
Avid System Monitor Server
Recommended; Intel SR2500 Server
Operating System
Windows 2003
Processor
2 GHz or better
Memory
2 GB
Java JDK
Provided with Avid System Monitor
PostgreSQL Database
Provided with Avid System Monitor
Client Browser
Adobe SVG viewer
Required for Internet Explorer client browser to view map pages
(Firefox etc have SVG viewer built in)
21
Pricing, Availability etc
•
Delivery
– Value-add offered to customers with Avid Uptime support
• Software download
• Phased roll-out at selected customer Production networks
– Typically switch monitoring
•
Pricing
– Avid System Monitor available to Avid Uptime support contract
customers
– PSG installation
• PSG engagement required
22
Summary
•
Real-Time monitoring of devices, services, networks & infrastructure
– Avid Customer Success
– Customer IT / Admin
•
•
Statistics, thresholds, events and notifications
Broad Enterprise system support
– Increasing breadth and depth
•
•
Pro-active warnings and notification of potential problems
Improved time to resolution
23
Avid Monitoring Solution
ISIS client,
Editor
OpenNMS GUI
ICMP (Ping)
Avid TCP Port monitoring
DNS
ICMP
HTTP/TCP
SNMP
Data collection
Trap receiver
Avid TCP Port monitoring
DNS, time sync
ICMP
SNMP
ICMP
SNMP
LAN
Switches
Interplay SNMP
AirSpeed
Service / IP monitoring
ICMP
SNMP
ASF Monitoring Gateway
ICMP
only
ASF Health Monitor
SNMP
SNMP
Interplay Engine,
Stream Server,
Archive
ISIS ISB,
ISIS switch
System Director
ISIS Engine
Lookup
Server
Media
Indexer
AirSpeedMS
Full Monitoring; events, statistics
25
Failure Modes Monitored
•
Avid System Monitor is tuned to identify specific failure modes
–
As found in field experience / Code Blue
•
Media Indexer
•
•
•
•
•
•
MI in the HAG with a weight of "0": Indicates an "election issue" which can cause major system slowdown.
Number of quarantined files growing: Indicates a faulty ingest device creating bad files.
Different file count between each of the HAG MI's: Indicates issue with ISIS notifications. Some files will appear offline to some clients.
Different time on each of the machines in the WG: Can be the cause of lost ISIS notifications (see above).
MI Heap usage running dangerously high: Indicates your WG file count or client count is causing too much stress on that MI. Eventually, the MI will thrash.
Number of files added/updated on last full resync, when it's greater than 0. This value is displayed in the Health Monitor, under each storage pane of the MI.
•
Interplay Engine
•
•
•
•
•
Time to perform login - should be below 15 seconds: indicates engine slowness
Number of journal files - should be below 50: indicates journal integration stuck/dead
Number of deletes - should be below 100 for 5 minute polling intervals during normal production time: indicates deletion during production time
Number of loaded objects/number of total objects - should be above 30%: indicates engine cache warm-up causing slowness
Backup running flag - should be off during production time
•
Avid Service Framework Lookup Service (LUS)
•
•
•
•
•
•
For LUS, here are things we could check today via SNMP Gateway. However, these monitor points don't really contribute to most of the problems we see related to ASF. They are the only
data points that are available today.
Monitor Handle Count (either via gateway or MSFT agent) - should be below some threshold (<5000)
Monitor Thread Count (either via gateway or MSFT agent) - should be below some threshold (<500)
Monitor Events In Queue (via gateway) - should be less than 50
Check that a process is bound to port 4160 on the box (don't know how to do that with OpenNMS) - confirms that the LUS process is running
Monitor Memory Usage (either via gateway or MSFT agent) - should be below some threshold (<200MB)
•
ISIS
•
ISIS monitors a number of critical areas and sends an event to the Windows event log when values reach a defined value or threshold. You can configure ISIS to send an email when an
error or warning event occurs. You can also configure the System Director to generate an SNMP trap when the event occurs. The top areas include the following:
Temperature and presence of components such as switches, storage elements, and power supplies.
Workspace usage thresholds. For example, an Admin can enable warning and error thresholds. If you can set the workspace threshold to 90%, ISIS will generate an error event when a
workspace reaches 90% full
Disk health issues such as disk failed or disk performance degraded based on continuous monitoring.
Server failover notifications. For example, on a failover system you are notified when the system fails over to the other node.
Metadata problems. For example: if there is a problem opening a metadata file or if the metadata in a file seems out of date
•
•
•
•
•
26
Monitored Device Matrix
Device / Service
Version(s), Notes
Inv
Unity ISIS
v2.x
√
Interplay Engine
V2.x
√
Media Indexer
V2.x
√
Interplay Engine
V2.x
√
Interplay Lookup Service (LUS)
V2.x
√
AirSpeed
√
AirSpeed Multi Stream
√
Capture Manager
√
Interplay Capture
Mon
√
V2.x
DNS & DHCP services
√
Avid Time Sync Service
√
√
3rd Party Network Switches
Cisco / Foundry / Force10
Windows Services
DNS,DHCP, Time, Anti-virus, auto-update etc
√
Inv
Real time Inventory; service or server Up/Down
Mon
Full monitoring; detailed alarms, statistics etc
27