The road to exceptional network performance management begins with a single step. With that in mind, VIAVI Solutions invites you to come along as Kevin Tolly takes you on an IT hero’s journey. From heightened awareness, to the battle call of a help ticket and revelation of network truths, uncover workflows and strategies for quick problem resolution.
- Part I: The Watchful Eye - Monitoring Vigilance
- Part II: A Call to Battle - Troubleshooting the Ticket
- Part III: Secrets Revealed - Identifying User Issues in the Packets
Learn to tell friend from foe, sniff out trouble, and harness the forgotten packets of the past to do what few network engineers have dared in this unique three-in-one webinar.
Learn more about Network Performance Monitoring solutions from VIAVI.
Network Performance Management Trilogy webinar transcript:
The point of this Network Performance Management webinar today is to impart information to you to help you make better decisions and understand how best to characterize your network. With that in mind we've broken this webinar into three pieces to be shared by both Steve, Charles, and Kevin.
Steve, take it away.
Thank, Brad. As Brad mentioned, we're going to really be exploring the journey of our IT Hero, the Network Team, and, really, the key challenges that you guys have around managing performance and user experience. And, as all good things come in threes, we're really going to be looking at that workflow of the network performance management, the different data sources that you can use on your journey as you're in these different phases, and, really, what works best and when to be using them. Then finally, more of a future-proof strategy for what you need in terms of managing and troubleshooting in terms of helping the end user and identifying performance issues rapidly, especially when you consider in the face of, as we're talking about the explosion of bandwidth, as we're seeing, and movements beyond 10 Gig.
In terms of what the goals of The Hero are, the network engineer, these are the things that you guys do day in, day out, it's the guaranteeing or dealing with end-user experience, ensuring network and application uptime availability and delivery, and, finally, whenever that phone rings or you get that notice or alert on your smart phone, handling these issues at fantastic speeds, because people are expecting their problems to be dealt with as soon as they say. "Hey. I've got a problem," even though it was an issue the occurred yesterday.
In terms of the way that we define that Trilogy of Network Performance Management, there are really three phases. There's the monitoring, that real-time tracking of what's going on on the network, across the enterprise.
Troubleshooting, being able to identify in triage that there is a problem and assigning the proper resources.
And then really doubling down in investigating that issue, identifying what the root cause is, resolving the issue.
At each one of these phases, there are different sources of data that you can go to find that answer whenever you're monitoring. We'll dive into those and really understand those phases.
The three data sources that we usually think of in powering the trilogy, powering network performance management, are Metadata or Statistics. System Logs. And then Packets or Deep Packet Inspection. As you can see mapped out against the phases of monitor, troubleshoot, and investigation, each one of these, while they're commonly used across all, definitely have strengths and weaknesses which really help you to determine when to using these.
Metadata is really strong in terms of monitoring, providing insight, and being able to triage for troubleshooting, but probably not the best to go to for actual deep dive investigation.
System logs, as you're actually getting into the investigation provide really good cursory evidence on who's doing what. And then DPI, as you're really getting into troubleshooting the issue or investigating that security issue, being able to offer full visibility in deep packet inspection to what's going on.
Steve, I think you hit it perfectly there. In reality, each one of these technologies can be leveraged, in part or in whole, for each one of the scenarios, whether it's for monitoring, for troubleshooting, or for the end all forensic investigation to say how did this event occur and then, of course, ensure that it doesn't occur in the future.
But, the key is using the right tool for the network performance management job. I oftentimes use analogies, and one of my favorite analogies here is, you could use a wrench to pound in a nail, and no judgment, I've pounded in many a nail with a crescent wrench in my day. But using a hammer's going to be more effective, more efficient, and get the job done better and faster. So, using the right data source for the right element of this overall trilogy or this journey of going from issue identification to problem resolution can make the overall job a lot easier. It can make the end product quite a bit better in definitively determining what happened, and why it happened, and then, of course, ensuring that it doesn't happen again.
I think I would agree with everything you said, Steve, that at the end of the day, these are all extraordinary valuable items, and if you leverage each one of them to their most effective and applicable element of the journey, your end result is oftentimes much better than attempting to keyhole something that doesn't necessarily fit.
An analogy for thinking about those sources of data and when each one is used that I heard and I really liked at Cisco Live was the idea of if you were actually a guard monitoring physical security, metadata is going to be that alarm or alert that's going to go to your iPhone or to your pager to let you know there's been a breach or a break-in.
So, as you're running back, you're delving, and you're trying to fix what happened after that event occurred, you delve into the system logs to see what employee or what user badged in, what time did they badge, what was their employee number, and who was it? So at least you're getting some of that cursory evidence.
Then, packets really function, that deep packet inspection has more of those surveillance cameras in each room that you can then, once you figured out where they badged into, you could go look and see who was there, was it really the person that the badge says it was, what did they break into. You're getting those extra details to really understand and resolve the issue.
From here what we wanted to do was really look at each one of these: metadata, system logs, and packets, and really understand what are the strengths and limitations of each whenever you're going through these various phases.
Metadata as Charles talked about really providing that enterprise-wide view of what's going on right now, and also being able to trend off that in long-term for planning purposes, to be able to understand what's going on in my network and how am I going to need a long-term plan to be successful as well as providing those real-time alerts.
Yeah, there's really no better data source out there for doing capacity planning, long-term understanding of what your typical loads are within a given server. Or as you start to look things like migrating from traditional, on-premise, physical infrastructure data centers off to a Cloud-based environment, metadata can be just an invaluable source of information there to help you make educated, informed decisions on when to transition and how to transition from a traditional environment to one of the more evolved environments.
As Charles pointed out with those analogies, it's highly scalable. The challenge is because it's highly aggregated it's often also highly pruned, meaning that while you're able to identify poor network performance management symptoms, while you're able to plan out because of trending just inherently in that data you have gaps, so the question becomes what's it backed by?
That's typically why we say that you're able to find symptoms, but maybe not locate why those symptoms are occurring. You're able to find out what's happening, but not necessarily the why behind it, which is why you would use other forms of data to really complement it. That's where people are also turning to system logs, they're ubiquitous, they're all over the place, you're able to really corroborate from that user level, from that device level, what was going on in terms of performance.
The challenge with network performance management is that it’s high level and difficult to corroborate to see trends. But as you're getting deeper into root cause analysis that's really where we really see things.
Yeah, no mistake about it. System logs can be one of the most unique sources of information. As Steve pointed out they're ubiquitous. Take advantage of them, because they're coming out of all your servers, your load balancers, your routers, your switches, the firewalls, your security infrastructure, and so on.
That's the value of system logs. They are literally everywhere and, if you've got the appropriate environment setup to consume them, they are just an absolute wealth of information.
But I think Steve's security analogy was a good one earlier. A system log is going to point out, for example, that the following credential set was used to access a resource. Whether or not that was the appropriate user of that credential set, what they did once they were in the system, how they interacted, how they exchanged content, whether or not there was any malicious patterns to their activity is very difficult to extract from system logs, in most instances.
So, that's the upside and the downside right there. Ubiquitous, take advantage them. They're out there, but at the same time understand that what you're looking at is a slightly different perspective of what transpired than what you might get if you had a video camera installed within the organization, and could record all the actual comings and goings, all the details of the activity that transpired.
And speaking of that video camera example source, that's really speaking to the strength of packet analytics or deep packet inspection. Having all of that evidence, having all of that network traffic literally there, you're able to really reconstruct, from a performance view, what exactly was happening.
From a network or an application standpoint, really gave me that contextual why something was going on. That may have been drawn to your attention by the metadata that you had, or the high level of reporting that you had.
That being said, I think part of the network performance management challenge is that can be easy to become inundated with packets, with network traffic, and really make sense of which packets are relevant, what do I need to sift through or go to or filter to, and that's really where it becomes incumbent upon the monitoring strategy that you have in place, as well as the tool selection that you make in terms of are those solutions that you have in place going to enable you to successfully leverage all three of these resources.
Yeah, I can say having come from two or three network support role way back in my past I can tell you there's no better feeling in the world than being able to definitively state that the issue is, in fact, not the network. But what it is a particular sequence that's happening if a particular resource in a given database table on a particular server within the farm is being fronted by a load balancer or a subset of users, et cetera, et cetera. Right?
Being able to definitively state what a problem is not only helps exonerate all the innocent elements of that, but it also helps you to easily say what action can we take to resolve this problem. There's just no other way to be able to definitively say what transpired and how it transpired and, frankly, how that sequence of events resulted in maybe a user experience degradation than using packets, and having full fidelity, full visibility to that communication.
As you'll see later, we'll talk about it a little bit, we'll talk about it a little bit, leveraging each one of these technologies in their most effective way will allow you to not only be able to get that high level alert, alarm, indication, of an issue existing, but then be able to help you further identify which packets are the most relevant, because, as you'll see as we go through the presentation today, packets are fantastic. The packets are wonderful. Why isn't everybody using them all the time. We're going to talk about that as we get into the presentation.
Yep. And that really does lead to that next spot which is the, we're talking about the virtues and the unique resources that network traffic represents. However, if that's the case, why aren't more people using them. When we were on the Cisco show floor asking people, I would say it was probably about a third to 40% of the people were using packet analytics. There's a large portion that are just simply using metadata or log files, and that's how they're getting by.
The question becomes "Why aren't more people using them?" In talking, I think it's largely, as you're looking out there, network performance management then is just simply having kept pace with bandwidth, with traffic, with the ability to be able to capture everything. Because the volumes of traffic being what they are, and then looking at how they shifted when they're going to 10 Gig, network performance managers really shifted away from packets as a resource and really shifted more towards metadata.
The challenge with this is if you're being selective in terms of timeframe, in terms of granularity, you don't have everything. That just naturally creates within any sort of analysis gaps that exist. Or you're averaging things so strong details might be hidden. Or events might be missed completely.
That's really the challenge. The thing is that as we're looking to 40 Gig it's only going to really get worse in terms of what we're seeing in terms of gaps.
Yeah. It's a great point. I think it's one of the most relevant points to today's discussion, which is when you talk to experts out there and how they troubleshoot their network communication issues, overall performance problems, they undoubtably will say, and 90-plus percentage of the time, "I want to use packets every single time." That's the definitive answer that's going to get me clarity on the issue to its root cause most effectively and most efficiently.
But, unfortunately, that's not feasible within my environment, and so often that becomes a by-product of the volume of information. As Steve pointed out, as capacity for the network has grown and as bandwidth demands have continued to grow, network performance management vendors have sought other approaches to be able to obtain and provide cause analysis.
But it comes at a price. Right? Nothing in life is free; everything has trade-offs associated with it. Metadata has the value of being able to provide high-level, overall aggregate trends of what's occurring within the environment. What it suffers from then is, as Steve pointed out earlier, highly aggregated, highly pruned, much of the detail is gone, lost forever, and oftentimes it's that detail it's ... unique nuances of a given conversation or a given exchange or even a given transaction, that can reveal what the root cause of a problem is, and can shorten your troubleshooting window from a few days down to a few minutes.
Hey, Charles. This is Kevin. Can I make a quick comment?
Your points are all good, and when you think about it, and you certainly as we're looking here at 10 Gig, and then, as you know, 40 Gig makes it worse, metadata's great. It takes and deals with some data ... The fact of the matter, and what you've been saying, I agree with, is at the end of the day, you need really need a 100%. Anything less than 100% might not be enough to find where your problem is. If you start pruning, "Well, we'll get 99%. Well, we'll get 90%."
Well, you know what? Unless you have 100% you can't be sure that the information that you need is in there. So, I think that's an important thing. What's good enough? Really it's got to be 100%.
Exactly. That's really insightful.
In terms of how this is really manifesting itself within the daily IT era, of the network engineer, really what we're seeing is there's a huge gap that exists. Annually we'll run what's called the State of the Network to really look at how emerging technologies, how troubleshooting is being impacted. One of the things that we always ask is, "What is the lead challenge?" Or "What's the greatest challenge that you have as a network professional, as an engineer, as a manager, as a head of IT in terms of managing applications?"
And what you'll see is we had a sample population of 1035 engineers, and the top thing that was cited by two-thirds of them was that root cause analysis, that their chief problem in managing applications is when they fail, where do I go? Do I go to the network, the system, or the application?
I think that that really speaks to that lack of having full fidelity forensic evidence of having the packets, of having the network traffic, and having all of this. Instead, they said they're left in the lurch. Really we see this growing larger only because the next question that we asked was, "How much do you expect your bandwidth demand for your organization's network to grow from its current level?"
What we're seeing is that well over 50% in that first year were saying that our network bandwidth is going to grow by up to 50%. So, a strong increase in them, likewise, that you'll see in that next, in 2018, is that 54%, the 10% plus the 44%, were saying that our bandwidth is going to grow by more than 50% in that second year. So, there's a constant increase. This is only going to exacerbate that issue of as I've got more traffic, as I have more things going on my network, more activities that I have to support as a network engineer, how am I going to determine what the root cause of problems are, especially when I have evidence that's missing?
... With that, one of the things that we recently announced in terms of the obvious solution going forward was in January, with this in mind, we came out with a whole new hardware platform aimed really at that full fidelity packet capture, namely in the form of GigaStor Generation 3. To market that was the first network performance management and monitoring solution that provided that full desaturated ability to capture all that basically a fully saturated 40 Gig line at line rate encrypted, stream it to disc, and also have it available concurrently for analytics.
And that was the way that we've dealt with it, for the life of that appliance, we're able to do that without dropping any packet, as well as being able to quickly analyze off of that work load. What I wanted to do was just step back and take a higher look, and look at it from a monitoring strategy standpoint.
So, we've talked about the different forms of data. We've talked about where to utilize them. Now in terms of strategy, what do you need now, and certainly moving into the future to be able to troubleshoot and resolve issues.
So with that I'm going to turn it over to Charles Thompson to talk further.
That's great, Steve. I appreciate it.
Let's jump into a discussion here where we talk about the leverage of metadata and the packets.
... If you look at how you're going to effectively troubleshoot issues that occur within your environment, the first and foremost item is you need visibility. You need a wide purview of what's occurring within the environment, and you need that to be available at the rates at which your network communicates. Oftentimes that's fully saturated 10 Gig. More and more it's getting into 40 Gig, and, of course, on the horizon we have 100 Gig upcoming.
So, you need an offering that's going to be able to provide you a wide purview of what's occurring within the environment, but still provide a granularity of information that doesn't aggregate or prune too heavily such that it becomes meaningless trends and not actual transactions or events.
If you look at what you're going to need for a meditative network performance management solution, one-minute granularity is one of the first ones that always pops to my mind. Some products out there in the industry today look at things in five or ten minute increments. For long-term trends over months or even quarters or years, and if you're looking to do capacity planning and optimization that type of five or ten-minute granularity can be quite valuable.
But when you're troubleshooting connections, conversations, events within the environment, getting granularity down to one minute means those bumps in the road, those individual potholes, the small, nuanced elements that occur can be more commonly brought to the surface and you can get visibility to what's occurring there.
The next is what I like think of as logical out-of-the-box network performance management workflows that lead to resolution. As a long-term consumer of products within the space, I can tell you, there is nothing worse than trying to go into a tool and you've got a dashboard with 50 graphs, 30 tables, more columns than you can shake a stick at, and you have absolutely no idea what to do with any of it. You've got more data at your fingertips than you've ever had before, and yet you've got even less direction in where you should be going to solve the problem at hand.
The folks in Atlanta are having problems with voice over IP. A million data points are not going to help you solve that. A logical workflow that leads you to a point of resolution is going to help you solve that. So you need these out-of-the-box workflows such that you're able to just simply feed the product with information, with raw data from your network, and have it produce meaningful insights from the get go.
KPIs that make sense and are actionable. You can populate a network performance management dashboard with a million red light, green lights, but at the end of the day, if they don't tie back to actionable items that you can actually take action against, they don't really help you do your job. They don't really help you solve the problem. They don't really help your end users get a better experience with your communication infrastructure or with the services that IT is designed to provide to them.
Part of that has to do with having the right sets of KPIs. There's a lot of great network performance management vendors in this space that are going to provide those KPIs. VIAVI is absolutely one of them.
Key, however, is making sure those KPIs are derived from your network, your user base, what they're accustomed to. Using things like fixed static thresholds can be good if you're in an environment where you're trying to enforce an SLA, or service-level agreement, that you've purchased from a provider. Static thresholds, however, are not very good at highly dynamic environments because what they end up doing is, A, making it more difficult for the administrator of the system, you, to actually effectively leverage the system, but it also makes it difficult for the system to dynamically adapt to your environment.
I think of that as dynamic baselining. Understand what's normal for that time of day, that day of week, that time of the quarter on your network, and then compare what's happening now versus what usually happens for, again, that time of day, day of week. That's highly contextual insight that allows you to be able to determine whether or not 500 millisecond response times are good or bad. Should it be red or should it be green? It's highly relative, and it's extraordinarily personal, so it needs that context of what your users expect and what your network typically runs at.
The last cited gear, and this one's near and dear to my heart, is the automated application map generation. If you're an organization that has large legacy applications there's a very good chance that nobody has a firm grasp on how those things truly work. We've all been victim of the undocumented changes. Right? Everyone in the organization has a map on their wall, they're all convinced that an application has five tiers, someone comes in and says, "Oh, no. That third tier, we got rid of that six months ago. That hasn't been there for a long time."
Or, "Well, I wish we could draw a map, but unfortunately all the folks who really knew how that application worked, they're not here anymore. They left." And now, you're tasked with helping the organization to transition from a traditional, big-iron, centralized data center environment out to a more modern, evolved, hyper-scale, dynamically managed Cloud-based environment, and you've got no idea how to move these applications around, how to migrate different elements of the application, what needs to be moved in tandem.
So, truly embracing Cloud is extraordinarily difficult when you don't have a good handle on how applications are built, and, of course, simply from a troubleshooting perspective, it's very difficult to say a user's having a problem with a large complex app, and not being able to say which tier of that application's having a problem.
One button, one-click, automated application mapping solves that in a heartbeat. Just 10 seconds after you identify the end user, you've got a full map of how that end user interfaces with the content at every tier of that application. So, that's a critical element of metadata.
But the key is that there's no amount of metadata that's going to always provide root cause analysis, and the reason for that is, is metadata is, by definition, comprised of statistics, highly aggregated data sets that have been pruned and cut off to provide on the top end values. So at the end of the day, metadata, and statistics in general, are great to provide you a trajectory. They're great to provide you a direction of investigation, and they help you to hone in on where problems will exist, or where problems do exist. But, oftentimes, they leave questions unanswered as to why the problems exist. They're a great smoke detector, but they're not an arson investigator, so they're not going to tell you why the problem occurred necessarily, they're going to give you indications that a network performance management problem occurred.
Highly valuable, absolutely believe 100% in it, but being able to get down to the how's and the why's are extraordinarily important. We used examples earlier on today's call where we talked a little bit about this idea of being able to go down to the granularity of individual transactions within a certain table, of the given database, on a particular server. That level of fidelity and granularity, you couldn't get it from metadata alone. What you could get from metadata is user A having a problem with server B with application C. That's fantastic and it provides a lot of great insight, but when you're tasked to actually help the team understand between user A and server B and application C, that's where the fidelity of packets comes into play, because packets provide you individual transactions, individual response codes, and context and payloads.
Zero Day Attacks are another great example of this. We got some fantastic statistics in our State of the Network study that showed just how often the average network engineer/network architect is spending time out of their day on security-related items. Zero day attacks are the most concerning for all security folks because zero day attacks oftentimes are not blocked by your perimeter security, and they're not identified by a lot of the IDS/IPS technologies out there. So you don't know that you've fallen victim to them until you've updated your security infrastructure, and that attack is no longer a zero day attack, it's now you're attack plus one, or attack plus three, or whatever it might be.
Being able to go back and investigate with absolute fidelity and 100% certainty how your environment was victimized by an attack, how widespread the attack was, what content may have been exposed during the attack. It is absolutely invaluable. It is the reason why every bank in the world, despite having locks on the doors, huge vaults, series and series of perimeter security, alarm systems, they also have security cameras. If the inevitable occurs, and it will, you have to have the ability to understand how it occurred, why it occurred, and, of course, what occurred with the event.
At the end of the day, that's what packets are all about. Packets don't lie. It's one of the favorite mantras of folks who do network performance management troubleshooting 24/7/365 as a living. Packets don't lie. There's nothing more telling about what occurred within the environment than packet content, because they allow you reconstruct exactly what happened, to actually reassemble artifacts, like emails, and web pages, and phone calls, but also just to be able to go down to the level of being able to, say, an individual transaction between a given host and a given client. There's nothing else in our environment that will allow you do that.
Now what we want to do is look at, as you're going forward, with your metadata, with packet capture, what are we going to need going out in terms of talking these higher bandwidth speeds, so as we're moving beyond 10 Gig, what exactly is it that we need? Or as we're moving into fully saturated 10 Gig environments, what do we need? What claims might you see out there in the industry, and what questions do you need to be asking as you're researching your monitoring strategy?
One of the most interesting areas, is at the end of the day, nobody wants to be left out of the conversation, nobody wants to be left out in the cold. 40 Gig is hot and it's an area of communication, an area of concentration for a lot of organizations, so there's a lot of claims out there around 40 Gig. A lot of organizations use certain tactics to be able to achieve what they call 40 Gig network analysis or network content analysis rates.
No harm, no foul. Understand where they come from on some of this stuff, but at the end of the day, if you're doing selective capture around only specific known events, if you're using slicing of content payloads, if you're sampling packets, then to Kevin's point from earlier, you don't have all the data all the time. You've got a small subset of that data. You're frankly not a whole lot better off than you would be if all you had was metadata. Unless you know up front which payloads are going to be interesting to you at some point in time in the future, which payloads are applicable, which packets are going to be relevant, which events are going to need deep forensic investigation?
Having some of the packets some of the time is just not acceptable. It's not acceptable when you're sitting in front of the board of directors trying to explain what happened during an event, and all you can answer is, "Well, I think it might have been this. Unfortunately, my network performance management tool had to sample those packets out because it couldn't keep up."
Or, "Unfortunately, it had to slice off that payload so that it could capture the headers."
The whole idea behind packet analysis is the deep forensic inspection, investigation that's capable when you've got every packet all the time.
So, when you think about how you're going to go about instrumenting an environment for things like 40 Gig capture rates, there's three key dimensions that I think are worth keeping in mind. The first is the most commonly discussed, and that's bits per second, 40 gigabits per second. That's most easily understood; we all get that; we all know what that means. And that's great. The key is that under the covers in real world production environment situations, simple bits per second doesn't tell the whole story. You've also got to factor in things like the number of packets per second. If you're in a high frequency trading environment, if you've got a lot of individual orders that happen within your systems, if you've got a large voice environment, then you're dealing with a lot of small packets to fill that 40 Gig pipe.
You could have a 40 Gig pipe filled with, let's just say, for example, six million packets to get the 40 Gig, if they're large packets. Or you could have 20 million packets to fill that 40 Gig pipe, or 15 million packets to fill that 40 Gig pipe. So, understanding a packet per second capability starts to evolve the conversation quite a bit, and take it from a controlled, lab-tested, yes, we can strive to get to 40 Gig per second if the conditions are right to a real world production level type and kind of environment.
In the final dimension there is unique IP addresses, and this one's critical, because in each system that generates metadata, or any system that's going to capture packets, and index those packets for fast retrieval is going to have to understand all the IP addresses that are communicating within the environment. Simply being able to say, "Sure, I can keep up with 40 Gig per second, as long as there's only 10 unique IP addresses on the network." That's great for a lab test. Right? It's great for a controlled environment. It is not real world, and it is not the situation that you're going to be putting these analysis tools into on your network.
Understanding not only the volume of traffic in terms of bits per second, the volume of packets that it takes to transport all those bits per second, and the volume of unique hosts or endpoints that are going to be involved in the communication are all critical dimensions as you start to think about, "How am I going to accomplish real 40 Gig per second production-level packet capture within my environments?"
With that what we want to do is turn it over to Kevin Tolly to run through the testing and the results as Tolly looks at the GigaStor platform for its ability to capture.
Thanks, Steve. Thanks, Charles. Let's just stay on this slide for a minute.
All of the discussion, the point that Steve and Charles made lead up to this. At the end of the day, you need 100% of 100%, so it's not just saying yeah, we got every packet, but also to Charles's point, one can say oh, I've got every packet, but only have a slice of the packet. That certainly makes the job a lot easier if you only have to get the first 64 or 128 bites, but the information that you might need for your network forensics could be elsewhere in that packet that wasn't captured.
So, as we take a look at our network performance management project, and this is all documented in the report that I think is available through the handouts part of the control panel. Let me just give you a little bit of background; we're going to stay on this slide for a minute.
This is a project conducted by Tolly in May, 2017, commissioned by VIAVI. The focus was on quantifying the multi-interface network traffic capture rates on both 10 Gig and 40 Gig, not just leaving it at 10 but going up to 40 Gig.
By capturing, we mean copying the traffic off of the network interfaces, and writing it to disc so the data could be used for network forensics and other tasks. The report can be found on Tolley's site and VIAVI site, Report #217122.
The background for this is that, as ... and pointed out by Steve and Charles, the rapid increases in network traffic volume, security, threats, et cetera, underscored the need for lossless data capture, so that's what this project was all about. There are certain functions here. Again, I'm going to go quickly through this because Steve and Charles have made these points very nicely previously, is that you need 100% of 100%. That's it at the end of the day.
And to Charles's point, yes, it's very important to understand traffic capture two ways. Bits per second or megabits or gigabits per second is important, but also the packet rate that's being captured because, depending upon your network characteristics, you could have a high rate of traffic in terms of gigabits or megabits per second, a very low packet rate, or vice versa. You need to look at it both ways. You need to understand it both ways so you'll see in the upcoming slides that we did just that, we looked at it both ways.
The large number of IP addresses, you'll see that we addressed that, no pun intended. We used a million IP addresses, in pairs of 500-thousand. Again, you'll see more of that in a minute.
Along with that time is important, so we looked at, forever, we looked at rates for a rich VIAVI solution could just capture and capture and capture for days or weeks or forever. Then we looked at in terms of if it's a situation where there's a surge on the network, if you need at a particular event, what would be the surge or burst rate. We'll look at that coming up.
Then, I think, as pointed out again, Steve and Charles did a nice job of this. They said when you have packets you everything, you have everything you need to do the forensics that are necessary. And when you have everything, you also have ... I don't know if was said yet, but certainly it's a clear part of the obvious strategy we'll go into is you have sensitive data. If you have every bit of every email and every bit of every sequel transaction you have a sensitive data.
So, it's also important that you have the data but that you have the data not fall into the wrong hands even inside, let's say, a given organization. One of the parts that you'll see in a minute is we did the testing where the data was captured in encrypted format such that we could remove the key and make that important data invisible to those who are not allowed to see it.
That's the high level view of what we did. We can probably move to the next slide now.
Thank you, Kevin.
As Kevin was mentioning, they focused on specific GigaStors. We have a variety in the GigaStor platform from software to hardware. The very first model that we focused on with the testing was the 288 terabyte GigaStor.
Excellent. Let's take a look at our test bed. Again, as was just noted, the testing was focused on the Observer GigaStor 288T. This was tested with two different physical interface configurations shown here. If you're having trouble seeing any of this, this and any of the figures that reference in my presentation come directly out of the Tolly reports, so if you have trouble seeing any of this or you need to make notes on it, just download the report and look at it from there.
On the left the 288T is connected via two 40 Gig interfaces, and on the right via eight 10 Gig interfaces. 80 gigs in both tests divided up by 40 or divided by 10. In both cases, traffic is generated by Ixia devices and send to the VIAVI via an Ixia Packet Broker.
You'll see in a minute what we did here was we used, in terms of how we tested it, is we looked at what the Ixia Solution was sending and receiving, making sure that everything that was going in was coming out, which made sense since it was going through the broker, and then using the data, the statistics from that device, and matching that up with the separate Observer GigaStor statistics to make sure that everything that was being generated was being accounted for.
Next slide, please.
I'll spend most of my time on this slide, and it won't be a lot of time because everything is very, very succinct.
We talked about 10, we talked about 40, we tested both ways, as you saw in our prior slide. The results were identical. You can see the two rows here, so that's good. So, whether you're in a multi-10 Gig environment or a multi-40 Gig environment or, more likely, in the 10 Gig moving to 40 Gig, you can expect the exact same performance.
As was noted earlier, you need to be able to capture full packets. This is all full packets. You need to be able to capture regardless of how many IP endpoints are in your environment. So here we have for all these tests, one million, figuring that was pretty good for just about everybody.
We had 500-thousand unique pairs were a million endpoints. As the note says on the bottom ... Again, if you're having any trouble seeing this, this is pulled right from our report. This is 100% capture of the entire packet. No slicing, no dicing, no filtering, so you've got everything there.
These tests, these set of tests were run two times. First they were run in just a standard environment, meaning standard hard discs, no encryption. Then we ran it again with AES 256 data-at-rest encryption enabled. Ran the whole thing, made sure that all of our right buffers and everything behaved exactly the same. It did.
We then further went on in that test to remove the keys and essentially close the disc and verified that, in fact, they were no long available, so we showed that functioned. But the key points here, so to Charles's comments, the first column is the capture to disc, maximum sustained rate, so here we're looking at it in terms of larger packets, right, if you have the higher payload, and here we were seeing consistently running forever 41.6 gigabits a second across the test environment. Whether it was 10 Gig, 8 times 10, or whether it was 2 times 40, it's there.
Then, looking at it the other way, and, again, as Charles pointed out, in situations where you have, let's say, high-speed trading or very high-speed small transaction environments, you might not be sending very large packets but you might be sending a lot of them per unit of time, per second, and you want to make sure that your capture device can keep up with the rate and not lose packets. We did that and the number we were able to get, as you can see here, 14.3 million packets a second, going on and on and on and on without any loss at all.
These are rates that as long as you have enough discs you can keep on going for ever. These are good solid numbers that you can rely on when you're looking at managing your environment and having the appropriate level of support.
Finally we did a test looking at the maximum sustained rate. What if we're looking at something just for 30 seconds, but we want to see what kind of rate we could capture at. Is it going to be the same? Is it going to be different? As you can see here, we were able to get up 60 gigabits per second for a shorter amount of time, for 30 seconds. In fact it was a little bit more than 30 seconds but we just rounded it down to 30 seconds for the sake of simplicity. But for a short period of time, 30, 35 seconds, with the 288 environment we able to get up to 60 gigabits, in terms of the full-capture environment. So, it's pretty impressive stuff.
Again, this and all the surrounding details, the detailed methodology, et cetera, et cetera, are all found in the Tolly report that I referenced.
Okay, next slide.
As Kevin was explaining, you saw the spike rate of 60 gigabits per second. With all the labeling on the GigaStor product, as you can see, we try to label conservatively so that way you know that you've got a little extra room should traffic go over, go above and beyond what was just stated that you have coverage.
The same held true in the next testing which you'll see is on the 192 terabyte unit.
I'll turn it back over to you, Kevin.
Here you can look again, I know we want to see if there's any questions, but here we go as Steve and Charles have pointed out, it's a product line. It's not just one product. We focused on the 288T to see what the absolute maximum, and then for customers that have needs that are not quite as demanding, you can see here the 192 first, and then the 96T both tested in the 8 times 10 gig environment. Very impressive, sustained data rates of 22.3 gigabits for the 192, and 11.6 gigabits for the 96T. Again, these are, when we say sustained, as far as we could tell from observing through our tests, this is sustained forever. So these are data rates that are not just for some brief amount of time, but the system's able to take these and write them to disc and keep them for you on an ongoing basis.
Next slide, please.
... Again, this is just information if you want to know exactly what we're doing. Version 17.2 for all this is the same. You can see the exact model numbers that we looked at, and as I had said earlier, the OS environment and the capture cards are all here. They can see, I don't think you can see it from the family slides that we've shown throughout here that this is all the same system scaled appropriately of disc capacity, chassis, and port types, depending on the customer environment.
Next slide, please.
... Just to wrap it up, from the Tolly side, what did we see. We saw high performance on eight 10 gigabit ports and two 40 gigabit ports. 41.6 gigabits and 14.3 million gps on the 288T with the 60 gigabit first. Capture performance was the same when the target discs were encrypted; that's very, very important, so you don't lose anything when you decide that you want and need encryption.
As I understand it, I asked the question that the encryption is an option. Maybe Steve or Charles can go into that. It's not you have to decide up front. You can use it later on and get a key and your system will be support both of it. Maybe that can be a nice seed question for the next slide. But it works very nicely and from what we can observe through our testing, quite seamlessly.
Then the 192 and 96T as shown on the prior slide have also very impressive throughput rates, scaled for a little different system in terms of the size. On the right side of this screen you can see the cover page of our report which we recommend that you read. It's available, I believe, through the control panel for this webinar, through VIAVI, and through Tolly.com.
And, Charles, Steve, back to you.
Okay, great, Kevin. I really appreciate that summary of the test report. You said it's available in a number of different locations per the slide that's indicated there.
We've already actually pulled a couple questions about Charles and Steve here. I wanted to say that we do have a bit of time, so if you have a question go ahead and type it in that chat window, and we'll try to get to it.
Meanwhile, Charles, perhaps you could start by talking about the AES-256 encryption of the GigaStor, and just how you go about enabling it and using that in your environment.
Absolutely. It's really one of the most interesting and unique feature sets of the GigaStor product family, that you have the ability to enable, on demand, AES-256-level, top secret-cleared encryption algorithms for all of the data that's being stored to disc. In terms of protecting that content, and, as we think about, being good stewards of the data that we're entrusted with. We retain that information in a highly secure way.
Now organizations can choose to not run that level of encryption, and not running it has some advantages. You don't have to enter a password when the system reboots in order gain access to the disc. I guess that's one minor advantage to not leveraging the encryption.
But more and more frequently we see customers taking advantage of that capability that's included with the GigaStor, so it doesn't come at any cost or really require any substantive effort on anyone's behalf to take advantage of it. And it provides just so much peace of mind. But, as you look at things, like all the different standards that are starting to occur with HIPAA, payment card industry, data security standards, and around the world different policies, governmental regulations, and so on that are requiring secure data-at-rest content protection. This ability to protect the content as it sits on the disc of the GigaStor is just invaluable.
The fact that it has zero impact on the sustainable write to disc rate because these boxes are still exceeding that 40 gig per second sustainable disc rate, it's fantastic. It means you get the best of both worlds. You get true 40 Gig capture and retention, and you get security. There's not a trade-off space there.
Hey, Charles this is Kevin. Can I ask you a question?
If someone's new to this, someone's in an environment where the corporate management has said, "It's time for us to do this," because of HIPAA, or because whatever, "we need to be able to capture and save all this data." But they're basically a greenfield right now.
Are there any guidelines that you can provide to them in terms of trying to roughly size whether or not they're a 96T, 192T, or 288T kind of customer?
Yeah. A lot of it will have to do with where they intend to locate the appliance on their network. It's very, very easy to work with one of our sales consultants to have a conversation around how long your retention window needs to be or that you desire it to be. Based on your ingress/egress traffic rates, how much retention that's going to require. We've got calculators that we give out to folks that help them to be able to do some of that work themselves. We're do it in conjunction with them
Generally speaking, on the edges of your networks, organizations tend to see smaller GigaStor appliances because you've got less data traversing there, and lower rates. In the core of the data center, or the core of their infrastructure tends to larger units that have longer retention windows and higher sustainable rates. So it does vary a little bit, but the great thing is that it's a science not an art, and we've got all the information available here within VIAVI to be able to help organizations to size out appropriately.
Good. Thank you.
Thanks, Charles. We do have one question, Charles. We're approaching the top of the hour. Perhaps you can go through just an example of the workflow, the kind of relationship the metadata, the packet-level data, and how that whole process works.
Yeah, it's a great question. In an evolved environment, we went from so many years ago, everything was just packet-based and realistically, you could use straight packet capture and decode in the memory of your laptop to do most of your network analysis.
Of course, things have evolved into leveraging more and more metadata all the time. One of the most common workflows that I see with our customers in this network performance management space is starting with metadata for what I call domain isolation. Imagine a scenario where you've got this sporadic issue occurring. You've got a few users within a given location that are experiencing a problem that no one else seems to be experiencing. The ability to go into a system, type in the IP address of those given end users or the application name that they're interfacing with, and perform domain isolation, to say, for example, "Is this issue related to the network, or is it related to the application?"
One line, one bar graph that has a portion of it filled with the network and a portion of it filled with the application. Clear, concise, easy to understand if the network is related to this issue or if, in fact, the issue is entirely contained within the application.
But hypothesize just for a minute that's in the application, that the information proves that out. We jump into scope and impact. Scope and impact gives us the ability to say, again, in metadata whether or not it's on a given server that the application's hosted on, that the issue's being experienced, or if it's across all of the servers that are hosting that application. That automatically gives us a lot of scope and impact information to understand if this may be an issue with the database. Or is this related to a given system?
Let's say, for example, that it's easily identified to be a specific server. Being able to then drill down and understand which users are experiencing that degradation, can all be done inside of metadata. Where packets become valuable as that next step forensic investigation analysis is to be able to extract the contextual packets for that application on that server for that client within that time window. The metadata workflow gave us all that information, and it built a compound filter on the fly, on demand. After clicking on the extract relevant information a few seconds later I get the few dozen packets that are directly related to that performance degradation, and I can see inside of those packets that it is a particular sequel select command against a given table that, time and time again, is producing very poor responses for our end users.
A workflow like that is extraordinarily difficult to do with either metadata or packets. You truly need the combination of both in order to effectively and efficiently go from high level issue identification through the cause analysis and down to root cause and remediation.
Great. Thank you, Charles. We're almost at the top of the hour. There were a couple of other questions. I think we'll follow up later to those people individually.
On behalf of the entire VIAVI organization, I want to thank you for attending, as well as Steve Brown, Charles Thompson, and Kevin Tolly for being here today. Again, we hope it was valuable for you. Happy network performance management.