Identifying Sensitive URLs at Web-Scale

Matic, Srdjan; Iordanou, Costas; Smaragdakis, Georgios; Laoutaris, Nikolaos

doi:10.1145/3419394.3423653

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.14279/23112

DC Field	Value	Language
dc.contributor.author	Matic, Srdjan	-
dc.contributor.author	Iordanou, Costas	-
dc.contributor.author	Smaragdakis, Georgios	-
dc.contributor.author	Laoutaris, Nikolaos	-
dc.date.accessioned	2021-09-24T09:36:59Z	-
dc.date.available	2021-09-24T09:36:59Z	-
dc.date.issued	2020-10-27	-
dc.identifier.citation	ACM Internet Measurement Conference, 2020, 27–29 October	en_US
dc.identifier.isbn	9781450381383	-
dc.identifier.uri	https://hdl.handle.net/20.500.14279/23112	-
dc.description.abstract	Several data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive web pages with third-party cookies, 87% of the third-parties set at least one persistent cookie.	en_US
dc.format	pdf	en_US
dc.language.iso	en	en_US
dc.relation.ispartof	ACM Internet Measurement Conference	en_US
dc.rights	© owner/author(s).	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Data privacy	en_US
dc.subject	HTTP	en_US
dc.subject	Websites	en_US
dc.title	Identifying Sensitive URLs at Web-Scale	en_US
dc.type	Conference Papers	en_US
dc.collaboration	TU Berlin	en_US
dc.collaboration	Cyprus University of Technology	en_US
dc.collaboration	IMDEA Networks Institute	en_US
dc.subject.category	Computer and Information Sciences	en_US
dc.country	Germany	en_US
dc.country	Cyprus	en_US
dc.subject.field	Natural Sciences	en_US
dc.publication	Peer Reviewed	en_US
dc.identifier.doi	10.1145/3419394.3423653	en_US
dc.identifier.scopus	2-s2.0-85097286610	-
dc.identifier.url	https://api.elsevier.com/content/abstract/scopus_id/85097286610	-
cut.common.academicyear	2020-2021	en_US
item.grantfulltext	open	-
item.languageiso639-1	en	-
item.cerifentitytype	Publications	-
item.openairecristype	http://purl.org/coar/resource_type/c_c94f	-
item.openairetype	conferenceObject	-
item.fulltext	With Fulltext	-
crisitem.author.dept	Department of Electrical Engineering, Computer Engineering and Informatics	-
crisitem.author.faculty	Faculty of Engineering and Technology	-
crisitem.author.parentorg	Faculty of Engineering and Technology	-
Appears in Collections:	Δημοσιεύσεις σε συνέδρια /Conference papers or poster or presentation

Files in This Item:

File	Description	Size	Format
3419394.3423653.pdf	Fulltext	497.22 kB	Adobe PDF	View/Open

CORE Recommender

Show simple item record

SCOPUS^TM
Citations 50

11

checked on Mar 14, 2024

Page view(s) 50

270

Last Week
0

Last month
3

checked on Nov 6, 2024

Download(s) 50

862

checked on Nov 6, 2024

Google Scholar^TM

Check

Altmetric

This item is licensed under a Creative Commons License