Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.14279/23112
DC FieldValueLanguage
dc.contributor.authorMatic, Srdjan-
dc.contributor.authorIordanou, Costas-
dc.contributor.authorSmaragdakis, Georgios-
dc.contributor.authorLaoutaris, Nikolaos-
dc.date.accessioned2021-09-24T09:36:59Z-
dc.date.available2021-09-24T09:36:59Z-
dc.date.issued2020-10-27-
dc.identifier.citationACM Internet Measurement Conference, 2020, 27–29 Octoberen_US
dc.identifier.isbn9781450381383-
dc.identifier.urihttps://hdl.handle.net/20.500.14279/23112-
dc.description.abstractSeveral data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive web pages with third-party cookies, 87% of the third-parties set at least one persistent cookie.en_US
dc.formatpdfen_US
dc.language.isoenen_US
dc.relation.ispartofACM Internet Measurement Conferenceen_US
dc.rights© owner/author(s).en_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectData privacyen_US
dc.subjectHTTPen_US
dc.subjectWebsitesen_US
dc.titleIdentifying Sensitive URLs at Web-Scaleen_US
dc.typeConference Papersen_US
dc.collaborationTU Berlinen_US
dc.collaborationCyprus University of Technologyen_US
dc.collaborationIMDEA Networks Instituteen_US
dc.subject.categoryComputer and Information Sciencesen_US
dc.countryGermanyen_US
dc.countryCyprusen_US
dc.subject.fieldNatural Sciencesen_US
dc.publicationPeer Revieweden_US
dc.identifier.doi10.1145/3419394.3423653en_US
dc.identifier.scopus2-s2.0-85097286610-
dc.identifier.urlhttps://api.elsevier.com/content/abstract/scopus_id/85097286610-
cut.common.academicyear2020-2021en_US
item.grantfulltextopen-
item.languageiso639-1en-
item.cerifentitytypePublications-
item.openairecristypehttp://purl.org/coar/resource_type/c_c94f-
item.openairetypeconferenceObject-
item.fulltextWith Fulltext-
crisitem.author.deptDepartment of Electrical Engineering, Computer Engineering and Informatics-
crisitem.author.facultyFaculty of Engineering and Technology-
crisitem.author.parentorgFaculty of Engineering and Technology-
Appears in Collections:Δημοσιεύσεις σε συνέδρια /Conference papers or poster or presentation
Files in This Item:
File Description SizeFormat
3419394.3423653.pdfFulltext497.22 kBAdobe PDFView/Open
CORE Recommender
Show simple item record

SCOPUSTM   
Citations 50

11
checked on Mar 14, 2024

Page view(s) 50

270
Last Week
0
Last month
3
checked on Nov 6, 2024

Download(s) 50

862
checked on Nov 6, 2024

Google ScholarTM

Check

Altmetric


This item is licensed under a Creative Commons License Creative Commons