You are on page 1of 69

Strategies and Tactics

:
Application Troubleshooting Simplified
How to speed time-to-root cause with network traffic recording and application-centric analysis

Fluke Networks Is…
Global
• • • • • A $300+ million company Profitable since its inception as a separate operating entity Over 800 employees worldwide service customers in more than 120 countries Approximately 30% of revenue from outside the U.S. Worldwide Headquarters: Everett, WA
– Major Facilities: Colorado Springs, CO; Duluth, GA; Bridgewater, NJ; Rockville, MD; – Sales Offices & Associates: Worldwide – Technical Assistance Centers: Everett, WA; Eindhoven, NL

Fluke Networks Is…
Thriving
• Backed by an $12B corporate parent, Danaher Corporation
– Fluke Networks and Fluke are both part of Danaher (NYSE: DHR)

Trusted
• Trusted by 98 of the Fortune 100 who use Fluke Networks solutions to deploy, solve, manage and optimize their networks.

Agenda
• Today’s Challenges • Complexity – Applications and the infrastructure that delivers them • Change – You think you know your network? Wanna bet? • Triage – Determining just who owns this problem • Root cause analysis (RCA) – What is the specific cause of latency? Best Practices • Getting in the Path of the Packets • Capturing all the Packets • Discovering Problems before the Customer Discovers Them • Resolving Problems in a Timely Manner

Challenges

Complexity? You got it!

…and this is just the view of the app inside of a data center. What is happening in that user’s network?

6

Change – You think you know your network?
• •

The network is in a constant state of change Making assumptions about
– the path packets take – utilization levels on ports – traffic distributions

• •

Can lead to increased problem resolution time Without a clear view of the current state of the network, it is very difficult to quickly resolve network and application related problems

7

Triage – Determining just who owns the problem
• • •

“It’s the network!” Whether it is a network problem, server problem, or application problem, the network always gets blamed first The faster we can determine the fault domain of the problem, the faster we can get the right resources working on it

8

Root Cause Analysis (RCA)
• •

Without a history of normal network operation, it is difficult to determine what is not normal Keeping a history of:
– – – – Utilization levels Roundtrip Latencies Protocol Distributions Packet captures of working applications

Allows us to get to the root of the problem, without chasing symptoms that are not really part of the problem

9

What best practices address the challenges?
Challenges
• • • •

Best Practices
• • • •

Complexity Change Triage Root Cause Analysis

Getting in the path of the packets Capturing all the Packets Discovering Problems before the customer does Resolving Problems in a Timely Manner

10

Best Practices
• Getting in the Path of the Packets
– – – – No Network Documentation Understanding Application Dependencies Tapping Technologies Virtual Machines

Capturing all the Packets
– High Bandwidth Utilization – What Happened Yesterday at 3pm?

Discovering Problems before the Customer Does
– Network is the backbone for everything – Automatically picking out problems from Gigabytes worth of data

Resolving Problems in a Timely Manner
– Need better understanding of how applications work – Remote offices

Getting in the Path of the Packets
• Why is this important?
– In order to analyze the application traffic and troubleshoot the problem, we must have the application packets in the capture buffer – The only way to get these packets in the buffer is to get in the path of the packets been the client and server

Getting in the Path of the Packets
• Knowing the flow of the packets
– Often times network administrators will think they know the path packets are taking through the network – In many cases however, the packets are taking a different path, making the troubleshooting process much more difficult – Without knowing the exact path, we cannot guarantee that we are in the path of the packets – Not only must we know the Layer 3 flow of the packets, but we also need to know the Layer 2 flow of the packets

Demonstration of Layer 2 and Layer 3 Traceroute

Knowing Application Dependencies
• Why is this important?
– Unless we know the devices on which a client, server, service or application is dependent, we do not know which paths to monitor – Once the conversations are documented, we can isolate the path between the two endpoints and connect the monitoring equipment

Host Conversations

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

OptiView Host Conversations Demonstration

Getting in the path
• Once we know the exact path of the packets, it is time to get into that path • There are three common methods of getting in the path and capturing the packets
– Hub – Span – Tap

• Each of these methods has its own pros and cons

Hubs
• Pros – Cheap – Available – Easy to install • Cons – Reduce link to half duplex – May not be a true hub – Not practical on servers or switch uplinks – If power drops, link drops – 10/100 Mbps speeds

Span/Mirror Ports
• Pros – Free – Available – Does not require link to be dropped – Great for one-time link monitoring
1 3 5 7 9 11 13 15 17 19 21 23 1 SYSTEM RPS STAT UTIL DUPLEX SPEED

• Cons – Requires switch access – Configuration mistakes can result in network outages – Can quickly become over provisioned – Requires a free switch port
CATALYST 3550
2

2

4

6

8

10

12

14

16

18

20

22

24

Taps
• Pros – Truly monitors full-duplex traffic – If power is lost link stays active – Can monitor gigabit links without packet loss – Once installed, can stay • Cons – Most expensive option – Have to break the link to install – Can over-provision the monitor port and drop packets

Tap Deployment
• Analysis equipment can be quickly connected to the network, without the need for configuration changes • Aggregators can be used to merge the traffic from multiple taps into a single stream • This allows a single analyzer to monitor traffic at multiple locations as well as redundant paths

Tap Deployment
• Having taps deployed at key locations provides easy access for the analysis equipment • These points include:
– In front of server farms – At the Internet connection – Switch Uplinks – Demarcation Points between Responsibilities

Tap Deployment

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Capturing in a Virtual Machine Environment
• The use of virtual servers create unique challenges when it comes to capturing packets • There is no place to attach a physical tap • The analyzer must be installed on the same virtual server as the virtual machines that are to be monitored • Using the vSwitch within the virtual server, the traffic can be spanned from the virtual machines to the virtual analyzer machine

Capturing in a Virtual Machine Environment

NTM Connecting to the vSwitch

Capturing all the Packets
• Why is this important?
– Without all of the packets, we are not able to analyze the application traffic. Some examples are:
• VoIP Traffic – If we do not capture all of the traffic, the analyzer will report a lower Mean Opinion Score (MOS) due to packet loss • TCP Traffic – The analyzer will report missing segments, which will give the appearance that packets are being lost on the network, when in fact they are not

– If the capture buffer is not big enough, the packets will roll out of the buffer, before anyone knows the problem even occurred

High Performance Packet Capture
• In order to capture all of the packets, we must
– Have a hardware capture card that can keep up with the data rate of the network. This was easy in the 10/100 days, with the deployment of 10 Gigabit networks, this has become much more difficult – Apply capture filters to the captured packets and discard those that do not match the filter – Transfer the filtered packets to the storage system at a rate equal to the data rate of the network – Index the captured traffic in such a way that it can be retrieved quickly by the protocol analyzer

High Performance Packet Capture
• 1 Ethernet traffic is captured from multiple ports at full line rates by FPGA-based capture card – hardware filters supported Entire frames are sent to the Packet Store repository for storage and post analysis Entire frames are also sent to the various analytical and real-time monitoring engines that process, classify and index data – this information is stored in the metadata database Atlas is the software interface that provides access to the rich network metadata information collected and created For troubleshooting and in-depth network analysis, a packet view engine facilitates fundamental protocol and multi-segment flow analysis
2 •

3 •

4 •

• 5

10 Gig Packet Capture
• H/W filter & frame de-duplication • Full Line Rate Capture with 2Gbps buffer • Fast PCI-e bus

10Gbps Adapter Card (2*10G XFP)

1Gbps Adapter Card (4*1G SFP)

How Stream to Disk Works
• All NTMs use RAID controller for high performance stream-to-disk • All NTMs carry multiple disks to support multi-thread storage • Large storage capacity models support RAID5 for redundancy • All NTM are specified with “true” capacity:
– True packet storage space – Addition storage available for OS, and metadata

Deployment of Stream to Disk
• Typically deployed in the data center, in front of application servers, data base servers, VoIP servers • For troubleshooting purposes, can be deployed in a portable fashion to capture traffic over long periods of time

Going Back in Time
• Often times a problem occurs, but no one reports the problem until several hours, or days later • The ability to go back in time allows the network analyst to search through the captured packets quickly to extract those packets related to the problem • The Network Time Machine provides the ability to select traffic by interface, time range, and device address

Going Back in Time Demonstration

Add Filter to narrow down scope

Deployment of Stream to Disk
• Fixed Location – Data Center
– Server farms – Database Servers – Load Balancers

• Portable Solution
– Remote Offices

Capture to Disk Deployment

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Discovering Problems before your Customer Does
• The network has become the backbone for most if not all of the communications within a organization • These include
– – – – – E-mail Phone Traffic (VoIP) Video (Video Conferencing and Surveillance) Business Critical Applications Non-Business Critical Applications, but still important to the end user (facebook)

• When these services are not performing well, the customer wants them fixed and fixed now!

Discovering Problems before your Customer Does
• No so easy • There are Terrabytes of information going across the network everyday • Most of this traffic is working properly, the trick is pulling out the traffic that is not working properly

Discovering Problems before your Customer Does
• Monitoring!!!!!! • The customer should not be used as a monitoring device • It is important to be able to discover network related problems, before the customer discovers them • To accomplish this, we need to be able to:
– Perform Real Time Analysis of the network traffic – Take advantage of information available to us through SNMP, RMON, NetFlow – Set monitoring thresholds and alarms to notify us when things are not performing as they should

Real Time Analysis
• It is important that the protocol analyzer be able to analyze packets not only after they have been captured, but as they are being captured • This allows you to detect problems as they are occurring, instead of waiting until the customer reports a problem • Detection can be combined with alerting, so that notifications are sent out when problems occur

Interface Utilization and Errors
• Packet loss and link congestion contribute to slow applications • Eliminating these problems from the network will positively impact all of the applications running across the network • Monitoring routers and switches using SNMP will allow you to quickly isolate those links experiencing high utilization and interface errors • The OptiView Portable Network Analyzer provides the capability to collect these values and graph them in a useful fashion

Interface Utilization and Errors

Interface Utilization and Errors
• FCS/CRC errors are a common problem on many networks • These errors result in packet loss, which in turn results in the retransmission of packets • Retransmission delays cause application delays • A typical cause of FCS/CRC errors are duplex mismatches • The OptiView Portable Network Analyzer displays the number of errors seen on each port, thereby reducing the time it takes to isolate packet loss.

Interface Utilization and Errors

Utilization and Interface Monitoring

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Application Response Time
• If the infrastructure supporting the application is running slowly, then the application will run slow • By monitoring the time it takes to traverse the network and connect to the server, we are able to either implicate or eliminate the network as the cause of application slowdowns • Monitoring these application ports over time will give us a baseline of the typical response time, which can then be compared with time periods when the application appears to be slow

Application Response Time Monitoring

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Utilizing SNMP Data
• Virtually ever device on the network has an SNMP agent • These agents can provide information about the performance, utilization, and faults on the device • This information includes:
– Host Resource Tables – Route Tables – ARP Caches

Host Resource Table
• SNMP enabled servers can be accessed with the OptiView Portable Network Analyzer • From these servers we can pull information about:
– – – – Memory and CPU utilization Running Processes Disk Utilization Number of Users

Host Resource Table Demonstration

Resolving Problems in a Timely Manner
• To minimize the impact of application problems to the client, it is important to resolve the problems in a timely manner • Factors that reduce the amount of time necessary to resolve problems are:
– Understanding the Application as far as dependencies, data flows, response times – Capturing in multiple locations and merging the packet captures to isolate packet loss and latency – Play back multimedia traffic to view the end user experience

Understanding Applications
• While the network analyst does not need to understand applications down to the code level, it is important to understand the network traffic related to applications • This understanding will help reduce the amount of time it takes to troubleshoot the application • A good practice is to capture the application traffic when the application is running well. • This good capture can be compared with the problem trace to reduce the amount of time it takes to isolate the problem

Application Centric Analysis
• Application Centric Analysis is the process of taking a top down approach to application troubleshooting as opposed to a bottom up approach • If it can be shown that the network is transporting traffic as it should, we can begin troubleshooting application by looking at data flows, instead of packets • This gives us a better picture of where the application may be failing, instead of digging through thousands of packets

What is a Transaction?
Business Transaction

User Action

Application Transaction

Packets

Packet #1

Go to Trade Page Look up Danaher Symbol Enter Symbol And Qty Submit Order

GET /tradepage.aspx GET /border.gif GET /dnarrow.gif GET /displayDHR.gif GET /stylesheet.css GET /javascript.js POST /submit_order.asp

Packet #2 Packet #3 Packet #4 Packet #5 Packet #6 Packet #7 Packet #8 Packet #9 Packet #10 Packet #11 Packet #12 Packet #13 Packet #14 Packet #15 Packet #16

Purchase 100 shares of Danaher stock

Demonstration of Application Centric Analysis

Multi-Segment Analysis
• In order to get a complete picture of the problem, we may need to see both sides of the conversation at the same time • By capturing on both sides and merging that traffic together, we are able to quickly identify the source of packet loss and delays • To perform this multi-segment analysis, we must be able to synchronize the traces based on time stamp

Multi-Segment Analysis
• ClearSight merges traces files from both analyzers
Client Network Web Server

Optiview

Multi-Segment Analysis
• Firewall Latency • Router Latency • Core Latency

Multimedia Playback
• In some cases it takes more than just looking at packets to resolve an application problem • When troubleshooting VoIP and Video problems, it is helpful to be able to play the media stream back to view the quality • Problems such as echo with VoIP cannot be determined by looking at the statistics or packets. The only way to detect echo is to listen to the audio stream

Keys to Troubleshooting Multimedia
• Need to have the appropriate Codecs available on the analysis equipment to playback media • Measurement of Metrics
– MOS – R-Factor – V-Factor

Where to deploy the Equipment
• The placement of the analysis equipment has a significant impact on the analysis • An analyzer placed close to the source of the multimedia traffic may not see the same problems as one placed near the destination

Portable Solution
• Having an portable analysis solution allows the analyst to move connect to various locations to isolate the problem • In cases of remote offices, the analysis solution can be shipped to the office to capture the end user experience

Use of Taps
• Having taps installed ahead of time provides a quick and easy way to connect the analyzer • The use of taps insures that the timing of the multimedia packets is not changed, which could adversely impact the metrics

Use of Taps

VoIP Analyzer Deployment

Data Center

VoIP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Demonstration of Multimedia Playback

Summary of Best Practices and Challenges
Best Practice
Getting in the Path of the Packets

Method
Flow of the packets Application Dependencies Span/Tap

Fluke Networks Tools
OptiView – Traceswitch Route OptiView – ICMP Traceroute OptiView – Host Conversations Fluke Networks Taps Network Time Machine Network Time Machine Network Time Machine – Atlas Metrics OptiView – Interface Utilization and Errors OptiView – Application Response Time OptiView – Host Resource Table ClearSight Analyzer – Application Centric Analysis ClearSight Analyzer – Multi-segment Analysis Network Time Machine – High Peformance Capture ClearSight Analyzer – Multimedia Playback

Capturing All the Packets

High Performance Packet Capture Capture to Disk – Back in Time

Discovering Problems before the Customer Does

Real Time Analysis

Using SNMP Data Diagnosing Problems in a Timely Manner Understanding Applications

Multimedia Analysis

Resources
• • • • • • 90-Day ClearSight Trial – requires unique Proof of Purchase (POP) Code found on the ClearSight Flyer handed out at the seminar 14-Day ClearSight Trial – if you misplaced your POP Code you can download the 14day trial at www.flukenetworks.com/csatrial Application-Centric Resource Center: www.flukenetworks.com/app-centric Network Forensics Resource Center: www.flukenetworks.com/ntmresources Portable Network Analysis: www.flukenetworks.com/optiview Request OptiView 5Day Evaluation: www.flukenetworks.com/optivieweval

• For additional information: Email: info@flukenetworks.com. Phone 800-283-5853 (US/Canada) or 425-446-4519 (other locations).