Please use this identifier to cite or link to this item:
https://hdl.handle.net/20.500.14279/23112
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Matic, Srdjan | - |
dc.contributor.author | Iordanou, Costas | - |
dc.contributor.author | Smaragdakis, Georgios | - |
dc.contributor.author | Laoutaris, Nikolaos | - |
dc.date.accessioned | 2021-09-24T09:36:59Z | - |
dc.date.available | 2021-09-24T09:36:59Z | - |
dc.date.issued | 2020-10-27 | - |
dc.identifier.citation | ACM Internet Measurement Conference, 2020, 27–29 October | en_US |
dc.identifier.isbn | 9781450381383 | - |
dc.identifier.uri | https://hdl.handle.net/20.500.14279/23112 | - |
dc.description.abstract | Several data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive web pages with third-party cookies, 87% of the third-parties set at least one persistent cookie. | en_US |
dc.format | en_US | |
dc.language.iso | en | en_US |
dc.relation.ispartof | ACM Internet Measurement Conference | en_US |
dc.rights | © owner/author(s). | en_US |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject | Data privacy | en_US |
dc.subject | HTTP | en_US |
dc.subject | Websites | en_US |
dc.title | Identifying Sensitive URLs at Web-Scale | en_US |
dc.type | Conference Papers | en_US |
dc.collaboration | TU Berlin | en_US |
dc.collaboration | Cyprus University of Technology | en_US |
dc.collaboration | IMDEA Networks Institute | en_US |
dc.subject.category | Computer and Information Sciences | en_US |
dc.country | Germany | en_US |
dc.country | Cyprus | en_US |
dc.subject.field | Natural Sciences | en_US |
dc.publication | Peer Reviewed | en_US |
dc.identifier.doi | 10.1145/3419394.3423653 | en_US |
dc.identifier.scopus | 2-s2.0-85097286610 | - |
dc.identifier.url | https://api.elsevier.com/content/abstract/scopus_id/85097286610 | - |
cut.common.academicyear | 2020-2021 | en_US |
item.grantfulltext | open | - |
item.languageiso639-1 | en | - |
item.cerifentitytype | Publications | - |
item.openairecristype | http://purl.org/coar/resource_type/c_c94f | - |
item.openairetype | conferenceObject | - |
item.fulltext | With Fulltext | - |
crisitem.author.dept | Department of Electrical Engineering, Computer Engineering and Informatics | - |
crisitem.author.faculty | Faculty of Engineering and Technology | - |
crisitem.author.parentorg | Faculty of Engineering and Technology | - |
Appears in Collections: | Δημοσιεύσεις σε συνέδρια /Conference papers or poster or presentation |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
3419394.3423653.pdf | Fulltext | 497.22 kB | Adobe PDF | View/Open |
CORE Recommender
SCOPUSTM
Citations
50
11
checked on Mar 14, 2024
Page view(s) 50
270
Last Week
0
0
Last month
3
3
checked on Nov 6, 2024
Download(s) 50
862
checked on Nov 6, 2024
Google ScholarTM
Check
Altmetric
This item is licensed under a Creative Commons License