Attendee:
Mauro Campanella (GARR)
Olav Kvittem (Uninett)
Simon Leinen (SWITCH)
Victor Reijs (HEAnet)
Nicolas Simar (DANTE)
Steve R. Williams (UKERNA)
Goal of the meeting is to share the experience and expectation of people for performance monitoring.
Behind performance monitoring the following concept are hidden: monitoring of delay, jitter, losses, etc.
The second part of the meeting aims to find out how to monitor these parameters.
NS:
- Interested by the monitoring of OWD, jitter, OWPL, re-ordering and correlate these measure with traceroute.
- Topology: one monitoring box per major PoP inside a domain. Full mesh of measurement between boxes. Measurement
done for two classes of services (Premium and BE). Measurement of the accesses should also be done. Can the end-to-end
measurement be approximate as sum of intra- and inter- domain measurements?
- Goal:
- Premium IP SLA verification for users, NOC, planning
- Network health (Premium and BE) for NOC, planning and NREN NOCs
- Troubleshooting for NOC
- Long term statistics for planning and NREN
- Alarm generation (data mining and alarm correlation needed)
- GPS installation could be costly (EUR2,000 for installation and EUR 1,500 per year).
VR:
- Active and passive monitoring.
- Network transport diagnosis metrics: path, BER, routing. Diagnosis shout de done after each hardware
and software changes, every day and for debugging.
- Network service diagnosis: overprovisionning - link load; Premium IP – queue fullness, link load ;
dscp transparency on the path. LBE: DSCP transparency. Should be checked after each hardware software changes,
every hour, weathermap, trend.
- SLS Metrics: one-way SLS: available bandwidth (+/- 1 Mbit/s), one-way packet loss, OWPL
(+/- 1-order), Loss pattern (-), IP packet delay variation (ipdv) (+/- 1 msec), Delay (OWD) (+/- 1 msec). Should
be checked after each every 5 minutes, weathermap, trend.
- Application metric (like MOS for audio and video or certain GRID monitoring tools)
- Scope of measurement inter and intra domain. End-to-end = concatenation of results possible for averages.
- Should work on other OS than unix (for end-users). Should be upgradable.
- Secure access to data and infrastructure. Trustable exchange of data, standardise protocol/methodology.
- Cooperation needed with other groups (like imrg-ietf. ippm)
SW:
- UK particle physics doing some ping experiment and storing data into LDAP dabtabase.
- Welsh video-conferencing project running monitoring from MCU to end-site with Cisco IPM. Send burst of 15 packets
20ms apart. IPM need better database to analyse results.
- similar activity in UKERNA. They have two options for the database: different tools sending data to a single
database in real time. Viewer access the data in real time. Second options various database and a single one collecting
the info needed for xxx. They want to develop a database keeping in mind it has to be extended to other domains.
Which (common between domains) information have to be into the database?
OK: to provide data to the users should help to reduce the amount of complains to the NOC. It should be used to proof
that the problems are not in the network.
SL:
- give some information about SWITCH’s RIPE TTM experience (2 boxes in the network TT 85 and 86). Difficult to have a
cable from the roof. Inter-domain a problem to see real time because can only access the received data (not the send
ones). Can get more info but the next day from the RIPE website.
- when delay change, get an alarm, most of them due to routing changes. Mail information well structured (once
understood). Give infos and grafs about delay, jitter, IPDV, losses. Account can be created. Delays are rock solid
when routing is unchanged.
- What is the goal of such tool, is it good value for money or marketing tool?
If sampling work, it is OK for 10Gbps as well as for 100Mbps. Sampling should also be part of the basic parameters.
VR: multicast part of the tool (at a later stage)
Tool should be IPv4 and IPv6 capable.
SL:
- it should allow to find the source of the performance problem. Should allow to have a view of the path end
from the middle of the network.
- We have to be careful with the fact that end site broke the traceroute with rate-limitation (most important
QoS problem). Fragmentation is also a problem and not straightforward to spot.
- TraceQoS allow measurement on subpath. Need servers with good connectivity to PoP. The traceroute give the
router ingress IP addresses. A flat ASCII file is used to match the router IP addresses and the servers. The
subpath tests are done in parallel and the result retrieved by one user.
Delay measurement are based on Poisson if less than 100 packets are generated
Interest to have in parallel links stats and delay stats.
The database should be extended to see anything: queues, delays, etc
OK:
- interest to have a weathermap at an European scale. Europe - country - region - …
- SCAMPI project, optical splitters at high speed.
- DOS attacks have signature which can be spotted
- Delay measurement by hashing techniques
MC: grid are mainly interested by bandwidth.
MC: at gigabit speed, the packet size are less and less important.
Database should have a general presentation.
SL: we should have standardise kind of measurement data and define how to retrieve data. We shouldn’t forget the data
ownership and access control.
The database should be statistically based.
MC: at gigabit speed, we shouldn't care about the packet size. It becomes less and less important for delay.
Sampling is independant of the sampling factor.
SL: We should standardise the kind of measuremed data we want to access in a webservice terminology. It should be
a front end. XML should be used to standardise the accesses.