Behind the Curtain:
A Look at the Inner Workings of NSA’s XKEYSCORE

by Micah Lee, Glenn Greenwald, and Morgan Marquis-Boire 2 July 2015

@micahflee @ggreenwald @headhntr

Illustrations by Blue Delliquanti and David Axe for The Intercept

The sheer quantity of communications that XKEYSCORE processes, filters and queries is stunning. Around the world, when a person gets online to do anything - write an email, post to a social network, browse the web or play a video game - there’s a decent chance that the Internet traffic her device sends and receives is getting collected and processed by one of XKEYSCORE’s hundreds of servers scattered across the globe.

In order to make sense of such a massive and steady flow of information, analysts working for the National Security Agency, as well as partner spy agencies, have written thousands of snippets of code to detect different types of traffic and extract useful information from each type, according to documents dating up to 2013. For example, the system automatically detects if a given piece of traffic is an email. If it is, the system tags if it’s from Yahoo or Gmail, if it contains an airline itinerary, if it’s encrypted with PGP, or if the sender’s language is set to Arabic, along with myriad other details.

This global Internet surveillance network is powered by a somewhat clunky piece of software running on clusters of Linux servers. Analysts access XKEYSCORE’s web interface to search its wealth of private information, similar to how ordinary people can search Google for public information.

Based on documents provided by NSA whistleblower Edward Snowden, The Intercept is shedding light on the inner workings of XKEYSCORE, one of the most extensive programs of mass surveillance in human history.

How XKEYSCORE works under the hood

It is tempting to assume that expensive, proprietary operating systems and software must power XKEYSCORE, but it actually relies on an entirely open source stack. In fact, according to an analysis of an XKEYSCORE manual for new systems administrators from the end of 2012, the system may have design deficiencies that could leave it vulnerable to attack by an intelligence agency insider.

XKEYSCORE is a piece of Linux software that is typically deployed on Red Hat servers. It uses the Apache web server and stores collected data in MySQL databases. File systems in a cluster are handled by the NFS distributed file system and the autofs service, and scheduled tasks are handled by the cron scheduling service. Systems administrators who maintain XKEYSCORE servers use SSH to connect to them, and they use tools such as rsync and vim, as well as a comprehensive command-line tool, to manage the software.

John Adams, former security lead and senior operations engineer for Twitter, says that one of the most interesting things about XKEYSCORE’s architecture is

“that they were able to achieve so much success with such a poorly designed system. Data ingest, day-to-day operations, and searching is all poorly designed. There are many open source offerings that would function far better than this design with very little work. Their operations team must be extremely unhappy.”

Analysts connect to XKEYSCORE over HTTPS using standard web browsers such as Firefox. Internet Explorer is not supported. Analysts can log into the system with either a user ID and password or by using public key authentication.

As of 2009, XKEYSCORE servers were located at more than 100 field sites all over the world. Each field site consists of a cluster of servers; the exact number differs depending on how much information is being collected at that site. Sites with relatively low traffic can get by with fewer servers, but sites that spy on larger amounts of traffic require more servers to filter and parse it all. XKEYSCORE has been engineered to scale in both processing power and storage by adding more servers to a cluster. According to a 2009 document, some field sites receive over 20 terrabytes of data per day. This is the equivalent of 5.7 million songs, or over 13 thousand full-length films.

This map from a 2009 top-secret presentation does not show all of XKEYSCORE’s field sites.

When data is collected at an XKEYSCORE field site, it is processed locally and ultimately stored in MySQL databases at that site. XKEYSCORE supports a federated query system, which means that an analyst can conduct a single query from the central XKEYSCORE website, and it will communicate over the Internet to all of the field sites, running the query everywhere at once.

There might be security issues with the XKEYSCORE system itself as well. As hard as software developers may try, it’s nearly impossible to write bug-free source code. To compensate for this, developers often rely on multiple layers of security; if attackers can get through one layer, they may still be thwarted by other layers. XKEYSCORE appears to do a bad job of this.

When systems administrators log into XKEYSCORE servers to configure them, they appear to use a shared account, under the name “oper.” Adams notes,

“That means that changes made by an administrator cannot be logged.”

If one administrator does something malicious on an XKEYSCORE server using the “oper” user, it’s possible that the digital trail of what was done wouldn’t lead back to the administrator, since multiple operators use the account.

There appears to be another way an ill-intentioned systems administrator may be able to cover their tracks. Analysts wishing to query XKEYSCORE sign in via a web browser, and their searches are logged. This creates an audit trail, on which the system relies to assure that users aren’t doing overly broad searches that would pull up U.S. citizens’ web traffic. Systems administrators, however, are able to run MySQL queries. The documents indicate that administrators have the ability to directly query the MySQL databases, where the collected data is stored, apparently bypassing the audit trail.

AppIDs, fingerprints and microplugins

Collecting massive amounts of raw data is not very useful unless it is collated and organized in a way that can be searched. To deal with this problem, XKEYSCORE extracts and tags metadata and content from the raw data so that analysts can easily search it.

This is done by using dictionaries of rules called appIDs, fingerprints and microplugins that are written in a custom programming language called GENESIS. Each of these can be identified by a unique name that resembles a directory tree, such as “mail/webmail/gmail,” “chat/yahoo,” or “botnet/blackenergybot/command/flood.”

One document detailing XKEYSCORE appIDs and fingerprints lists several revealing examples. Windows Update requests appear to fall under the “update_service/windows” appID, and normal web requests fall under the “http/get” appID. XKEYSCORE can automatically detect Airblue travel itineraries with the “travel/airblue” fingerprint, and iPhone web browser traffic with the “browser/cellphone/iphone” fingerprint.

PGP-encrypted messages are detected with the “encryption/pgp/message” fingerprint, and messages encrypted with Mojahedeen Secrets 2 (a type of encryption popular among supporters of al Qaeda) are detected with the “encryption/mojaheden2” fingerprint.

When new traffic flows into an XKEYSCORE cluster, the system tests the intercepted data against each of these rules and stores whether the traffic matches the pattern. A slideshow presentation from 2010 says that XKEYSCORE contains almost 10,000 appIDs and fingerprints.

AppIDs are used to identify the protocol of traffic being intercepted, while fingerprints detect a specific type of content. Each intercepted stream of traffic gets assigned up to one appID and any number of fingerprints. You can think of appIDs as categories and fingerprints as tags.

If multiple appIDs match a single stream of traffic, the appID with the lowest “level” is selected (appIDs with lower levels are more specific than appIDs with higher levels). For example, when XKEYSCORE is assessing a file attachment from Yahoo mail, all of the appIDs in the following slide will apply, however only “mail/webmail/yahoo/attachment” will be associated with this stream of traffic.

To tie it all together, when an Arabic speaker logs into a Yahoo email address, XKEYSCORE will store “mail/yahoo/login” as the associated appID. This stream of traffic will match the “mail/arabic” fingerprint (denoting language settings), as well as the “mail/yahoo/ymbm” fingerprint (which detects Yahoo browser cookies).

Sometimes the GENESIS programming language, which largely relies on Boolean logic, regular expressions and a set of simple functions, isn’t powerful enough to do the complex pattern-matching required to detect certain types of traffic. In these cases, as one slide puts it, “Power users can drop in to C++ to express themselves.” AppIDs or fingerprints that are written in C++ are called microplugins.

Here’s an example of a microplugin fingerprint for “botnet/conficker_p2p_udp_data,” which is tricky botnet traffic that can’t be identified without complicated logic. A botnet is a collection of hacked computers, sometimes millions of them, that are controlled from a single point.

Here’s another microplugin that uses C++ to inspect intercepted Facebook chat messages and pull out details like the associated email address and body of the chat message.

One document from 2009 describes in detail four generations of appIDs and fingerprints, which begin with only the ability to scan intercepted traffic for keywords, and end with the ability to write complex microplugins that can be deployed to field sites around the world in hours.

If XKEYSCORE development has continued at a similar pace over the last six years, it’s likely considerably more powerful today.

Illustration for The Intercept by Blue Delliquanti

Documents published with this article:

document date pgs. bytes
Advanced HTTP Activity Analysis 2009 114 86 mb
Analyzing Mobile Cellular DNI in XKS 2009-05 17 8.3 mb
ASFD Readme no date 2 256 kb
CADENCE Readme no date 4 456 kb
Category Throttling 2007-01-08 2 240 kb
CNE Analysis in XKS 2009-10-15 33 18 mb
Comms Readme 2007-01-08 8 1012 kb
DEEPDIVE Readme no date 6 696 kb
DNI101 no date 2 704 kb
Email Address vs User Activity 2009-06-24 20 15 mb
Free File Uploaders 2006-08-13 33 23 mb
Finding and Querying Document Metadata 2009-04 24 18 mb
Full Log vs HTTP 2009-06-11 33 25 mb
Guide to Using Contexts in XKS Fingerprints 2007-01-08 24 2.1 mb
HTTP Activity in XKS 2009-03 21 15 mb
HTTP Activity vs User Activity 2009-06-19 36 26 mb
Intro to Context Sensitive Scanning With XKS Fingerprints 2010-05 61 9.3 mb
Intro to XKS AppIDs and Fingerprints 2009-08-27 60 44 mb
OSINT Fusion Project 2009-05-18 24 14 mb
Phone Number Extractor 2009-10 20 14 mb
RWC Updater Readme 2007-01-08 13 1.7 mb
Selection Forwarding Readme no date 2 256 kb
Stats Config Readme 2007-01-08 4 408 kb
Tracking Targets on Online Social Networks 2009-09 21 27 mb
TRAFFICTHIEF Readme no date 5 632 kb
Unofficial XKS User Guide 2007-01-08 27 3.4 mb
User Agents 2010-07 42 63 mb
Using XKS to Enable TAO 2009-07-16 47 34 mb
UTT Config Readme no date 5 564 kb
VOIP in XKS 2009-03 13 9.8 mb
VOIP Readme no date 3 396 kb
Web Forum Exploitation Using XKS 2009-07 7 948 kb
Writing XKS Fingerprints 2010-11 67 40 mb
XKS Application IDs no date 42 36 mb
XKS Application IDs Brief no date 21 18 mb
XKS as a SIGDEV Tool no date 44 33 mb
XKS, Cipher Detection, and You! 2008-08-21 17 1.2 mb
XKS for Counter CNE 2011-03 21 19 mb
XKS Intro 2008-02-25 30 29 mb
XKS Logos Embedded in Docs 2007-01-08 2 248 kb
XKS Search Forms 2009-03 49 44 mb
XKS System Administration 2012-12 144 123 mb
XKS Targets Visiting Specific Websites 2007-01-08 3 244 kb
XKS Tech Extractor 2009 2009-12 14 12 mb
XKS Tech Extractor 2010 2010-04 23 3.7 mb
XKS Workflows 2009 2009-03-05 24 19 mb
XKS Workflows 2011 2011-09-19 26 20 mb
UN Secretary General XKS no date 1 336 kb