Repository logoCyprus University of Technology
Log In(current)
Ελληνικά
English
  1. Home
  2. Cyprus University of Technology (Research Output)
  3. Δημοσιεύσεις σε συνέδρια /Conference papers or poster or presentation
  4. Identifying Sensitive URLs at Web-Scale
  • Details

Identifying Sensitive URLs at Web-Scale

Journal
ACM Internet Measurement Conference
Date Issued
October 27, 2020
Author(s)
Matic, Srdjan  
Iordanou, Costas  
Smaragdakis, Georgios  
Laoutaris, Nikolaos  
DOI
10.1145/3419394.3423653
Abstract
Several data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive web pages with third-party cookies, 87% of the third-parties set at least one persistent cookie.
Subjects

Data privacy

HTTP

Websites

File(s)
Thumbnail Image
Name

3419394.3423653.pdf

Size

497.22 KB

Format

Adobe PDF

Checksum (MD5)

7e1ee4d9ac4087ac0e514640c80f0b88

Explore by
  • Collections
  • Research Outputs
  • Researchers
  • Faculty & Departments
  • Theses
  • Patents
  • Projects
  • Journals
  • Conferences
Useful Links
  • Researcher Portfolio Guide
  • Researcher Profile
  • Create an ORCID ID
  • CUT Open Access Author Fund
  • ETDS Guide
Copyright Policies

Use Sherpa/Romeo to find publisher copyright policies

Go
Go
  • SPARC Author Addendum Engine
  • National Open Access Policy in Cyprus
Deposit your work to Ktisis
  • Self-archiving. Please sign in to Ktisis.
  • Email your work to:
    library.dspace@cut.ac.cy
  • Contact your subject librarian

Member of

OpenAIREre3dataOpenDOARCOREDART
Cyprus University of Technology
Library and
Information
Services

Copyright © 2022 - Library and Information Services Feedback - Built with DSpace-CRIS - 4Science

  • Accessibility settings
  • Privacy policy
  • End User Agreement
COAR NotifyCOAR Notify