Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.14279/13837
DC FieldValueLanguage
dc.contributor.authorOuthred, Geoff-
dc.contributor.authorBalakrishnan, Shobana-
dc.contributor.authorFitter, Percy-
dc.contributor.authorHerodotou, Herodotos-
dc.contributor.authorDing, Bolin-
dc.date.accessioned2019-05-31T06:39:51Z-
dc.date.available2019-05-31T06:39:51Z-
dc.date.issued2014-01-01-
dc.identifier.citation20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014; New York, NY; United States; 24 August 2014 through 27 August 2014en_US
dc.identifier.isbn978-145032956-9-
dc.description.abstractLarge-scale data center networks are complex - comprising several thousand network devices and several hundred thousand links - and form the critical infrastructure upon which all higher-level services depend on. Despite the built-in redundancy in data center networks, performance issues and device or link failures in the network can lead to user-perceived service interruptions. Therefore, determining and localizing user-impacting availability and performance issues in the network in near real time is crucial. Traditionally, both passive and active monitoring approaches have been used for failure localization. However, data from passive monitoring is often too noisy and does not effectively capture silent or gray failures, whereas active monitoring is potent in detecting faults but limited in its ability to isolate the exact fault location depending on its scale and granularity. Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals. In particular, we compute a failure probability for devices and links in near real time using data from active monitoring, and look for statistically significant increases in the failure probability. We also correlate the probabilistic output with other failure signals from passive monitoring to increase the confidence of the probabilistic analysis. We have implemented our approach in the Windows Azure production environment and have validated its effectiveness in terms of localization accuracy, precision, and time to localization using known network incidents over the past three months. The correlated ranked list of devices and links is surfaced as a report that is used by network operators to investigate current issues and identify probable root causes. © 2014 ACM.en_US
dc.formatpdfen_US
dc.language.isoenen_US
dc.rights© ACMen_US
dc.subjectdata center networksen_US
dc.subjectfailure localizationen_US
dc.titleScalable near real-time failure localization of data center networksen_US
dc.typeConference Papersen_US
dc.collaborationMicrosoft Researchen_US
dc.collaborationMicrosoften_US
dc.subject.categoryElectrical Engineering - Electronic Engineering - Information Engineeringen_US
dc.countryUnited Statesen_US
dc.subject.fieldEngineering and Technologyen_US
dc.publicationPeer Revieweden_US
dc.relation.conferenceInternational Conference on Knowledge Discovery and Data Miningen_US
dc.identifier.doi10.1145/2623330.2623365en_US
dc.identifier.scopus2-s2.0-84907020098en
dc.identifier.urlhttps://api.elsevier.com/content/abstract/scopus_id/84907020098en
dc.contributor.orcid#NODATA#en
dc.contributor.orcid#NODATA#en
dc.contributor.orcid#NODATA#en
dc.contributor.orcid#NODATA#en
dc.contributor.orcid#NODATA#en
cut.common.academicyear2013-2014en_US
item.fulltextNo Fulltext-
item.cerifentitytypePublications-
item.grantfulltextnone-
item.openairecristypehttp://purl.org/coar/resource_type/c_c94f-
item.openairetypeconferenceObject-
item.languageiso639-1en-
crisitem.author.deptDepartment of Electrical Engineering, Computer Engineering and Informatics-
crisitem.author.facultyFaculty of Engineering and Technology-
crisitem.author.orcid0000-0002-8717-1691-
crisitem.author.parentorgFaculty of Engineering and Technology-
Appears in Collections:Δημοσιεύσεις σε συνέδρια /Conference papers or poster or presentation
CORE Recommender
Show simple item record

SCOPUSTM   
Citations 10

30
checked on Mar 14, 2024

Page view(s) 10

271
Last Week
3
Last month
12
checked on May 17, 2024

Google ScholarTM

Check

Altmetric


Items in KTISIS are protected by copyright, with all rights reserved, unless otherwise indicated.