You Are Who You Appear to Be:
A Longitudinal Study of Domain
Impersonation in TLS Certificates

Richard Roberts*, Yaelle Goldschlag*, Rachel Walter*,
Taejoong Chung, Alan Mislove, Dave Levin*

*University of Maryland, Rochester Institute of Technology, Northeastern University

Paper Overview

Abstract

The public key infrastructure (PKI) provides the fundamental property of authentication: the means by which users can know with whom they are communicating online. The PKI ensures end-to-end authenticity insofar as it verifies a chain of certificates, but the true final step in end-to-end authentication comes when the user verifies that the website is what they expect. To this end, users are expected to evaluate domain names, but various “domain impersonation” attacks threaten their ability to do so. Indeed, if a user could be easily tricked into believing that amazon.com-offers.com is actually amazon.com, then, coupled with security indicators like a lock icon, users could believe that they have a secure connection to Amazon.

We study this threat to end-to-end authentication: (1) We introduce a new classification of an impersonation attack that we call target embedding. This embeds an entire target domain, unmodified, using one or more subdomains of the actual domain. (2) We perform a user study with the specific goal of understanding whether users fall for target embedding, and how its efficacy compares to other popular impersonation attacks (typosquatting, combosquatting, and homographs). We find that target embedding is the most effective against modern browsers. (3) Using all HTTPS certificates collected by Censys, we perform a longitudinal analysis of how target-embedding impersonation has evolved, who is responsible for issuing impersonating certificates, who hosts the domains, where the economic choke-points are, and more. We close with a discussion of counter-measures against this growing threat.

What is "Target Embedding"?

A wide range of domain impersonation attacks have been identified. These include: typosquatting, in which the impersonating domain has a small edit distance from the target domain (faceboook.com); bitsquatting, in which a bit in the ASCII representation is flipped (fagebook.com); combosquatting, in which the attacker includes a target’s brand name alongside other string tokens (facebooklogin.com), homographs, that use “confusable” characters— often Unicode characters used in Internationalized Domain Names (IDNs) —that look like the real characters (faceb00k.com); and homophones, domains that sound the same as a target domain when read aloud (fasebook.com). All of these impersonation attacks occur in the effective second-level domain, or e2LD (e.g., example in example.com).

We expand on this work by introducing a type of impersonation that we call target embedding. Simply put, a target embedding domain embeds a complete, unmodified target domain, including the TLD, by using one or more subdomains of the real domain. The target domain is separated from the rest of the domain on the right (and optionally on the left) by either a period (.) or a hyphen (-). For example, consider the target embedding domain “www.facebook.com.user-29de84ca4bfa72.tk”. The target, in this case “facebook.com”, is embedded using subdomains of the actual domain, “user-29de84ca4bfa72.tk”. The target’s TLD can also appear in the real e2LD, such as apple.com-login.pw. Unlike prior domain impersonation attacks, target embedding does not operate strictly within the e2LD: in fact, it requires the use of at least one subdomain, as all target domains have at least one period between their e2LD and TLD.

To motivate our study of target embedding, we performed a user study with a solitary goal: to understand how thoroughly users fall for target embedding, as compared to other popular domain impersonation attacks (typosquatting, combosquatting, and homographs). Taken together, our results show that target embedding leads to significantly more user mistakes than any other impersonation attack currently possible in modern browsers. Moreover, the results show that if a user falls for a target embedding attack once, they are likely to fall for it multiple times—more so than with other domain impersonation attacks. Summarized simply: to users, domain names are who they “appear” to be, and target embedding is currently the most effective means of appearing to be someone a domain is not.


For more information check out our paper from CCS'19

Datasets

Our primary dataset comprises all certificates collected by Censys up to May 18, 2019. Censys’s dataset includes a combination of active scans (they scan all IPv4 addresses and popular TLDs’ zone files) and Certificate Transparency (CT) logs. Prior work has estimated that this combined dataset captures over 99% of observed certificates. As discussed in our paper, we filter this dataset down to 435,717 unique certificates that contain 256,045 unique target-embedding domains. We provide this dataset here.

Name Type Size Format SHA-256 Hash (Compressed)
Target Embedding Dataset gzipped tsv (tab-separated values) 69 MB README Show 97ecf90db94917cd76a366318d6aaab9ebec882fe029786b6a4a3cb3a95dc4c5

This web page will be updated in the near future to include additional datasets discussed in our paper (wildcards, coordinated campaigns, comparisons to typosquatting/combosquatting), as well as our code that detects if a domain is target-embedding. If you are in need of this data now, reach out using the contact information below.

Contact

If you have any questions, comments or concerns, or if you're interested in using our data in your research, please email Richard Roberts!