DNS-based web metrics collection mechanism

For the 30 years that the worldwide web has existed people have sought to understand and classify the behavior of end user. From initial harvesting of web server logs, to the invention of beaconing web bugs, industry has been trying to extract analytics and perform data analysis. The current state in industry is that there is a plethora of competing collectors with each part of the enterprise selecting their own tracking mechanism. This leads to substantial overhead and even resource exhaustion. We need a mechanism that will reduce the impact of collecting web analytics, both in terms of page load times and cookie proliferation.

The W3C has added the ping attribute to the <A> tag to instruct the browser to simultaneously load the HREF target and POST to the PING target, which ameliorates the impact of redirect chains. It is however dependent on the User Agent to implement it, which is not a foregone conclusion that it is widely implemented.

Security-conscious practitioners know very well that the DNS protocol can be used to quietly exfiltrate data from a protected network. The curious reader may consult the DNS Tunneling technique (here).

What if we were to combine these things? What if we were to somehow leverage the DNS Tunneling technique to collect web metrics. We could alternatively replace the current method or augment the current methods.

The web metric data is encoded within a purposely written DNS Query packet –eg the mechanism used for DNS Tunnelling— for transmission to collector instantiated as Authoritative DNS server. We show several methods, and embodiments, to perform this type of white-hat data exfiltration to provide performance and reliability gains from other existing methods.

How it would work

Script on client’s side takes all the metrics data to be sent and create a string out of them. Also adds verification string as well as a nonce string.
Data are encoded to produce a string base64, if the data string is too long, it may be compressed using common compression techniques such as gzip, or bzip2.
The client requests this host from a subdomain (foo) on a domain (example.org), where host is equal to generated string. The fully qualified domain name would be a1b2c3d4e5f6g7.foo.example.org in this examplar.
DNS resolution of the name would carry through to the Authoritative DNS server for the example.org domain, which decodes base64 and optionally verifies data validity (looks for predefined string or compares the checksum). If data validation is not turned on, it is assumed that data are correct.
- If the data are correct, responds with “does not exist” or IP address of classic collector, depending on the embodiment (replacement vs augmentation).
- If the data are incorrect, responds with bogus IP address (eg. 172.172.172.172)

In some embodiments the method can behave as a pass-through proxy or as an exploding proxy. In these cases, response to client is instant and Authoritative DNS server is communicating with Pixel collector(s) and/or Data Processing Service(s) after user receives the response.

The client would load the image response from Pixel Collectors, no IP address or fail instantly.
The Authoritative DNS server and optionally the pixel collector would then extract the transmitted data from their logs and pass the data to a service designated to collect and process metrics. It is important to note that an extension to this embodiment may have the ADNS device post metrics to more than one collector, thus behaving as an broker for multiple collecting agencies.

Embodiments

Let me walk you through a few embodiments of this idea might work through an example of sending the value “uniqueID=userID” for the “foobar” page to a pixel server at example.org. In other words, the web client would be loading “http://example.org/uc.gif?pageData=foobar” to transmit the “uniqID=UserID” cookie value to the collector.

As a replacement capability of the existing web metrics collectors

Combine “pageData=foobar” and “uniqID=userID” tokens into a single string, eg “pageData=foobar;uniqID=userID”.
Add predefined string (or checksum) into tokens string, eg “pageData=foobar;uniqID=userID;%DNS_COLLECTOR_V1%”.
Use base64 encoding to produce a string, eg “cGFnZURhdGE9Zm9vYmFyO3VuaXFJRD11c2VySUQ7JUROU19DT0xMRUNUT1JfVjEl”.
If the data string is too long, it may be compressed using common compression techniques such as gzip, or bzip2.The client would request http://cGFnZURhdGE9Zm9vYmFyO3VuaXFJRD11c2VySUQ7JUROU19DT0xMRUNUT1JfVjEl.example.org/. However should the data be compressed the client would request http://cGFnZURhdGE9Zm9vYmFyO3VuaXFJRD11c2VySUQ7JUROU19DT0xMRUNUT1JfVjEl.compressed.example.org/.

DNS resolution of the name would carry through to the Authoritative DNS server for the example.org domain, which decodes base64 and optionally verifies data validity (looks for predefined string or verifies checksum). If the optional data validation is turned off, it is assumed that the data are correct.
- If the data are correct, responds with no IP address, eg “does not exist” message.
- If the data are incorrect, responds with bogus IP address (eg. 172.172.172.172)
The client would fail in its attempt to load the resource in 2 possible ways – if the data validation was successful, no ip address, otherwise a bogus IP address. This would be an immediate failure, thereby freeing the client to continue processing other directives in the loaded page.
The Authoritative DNS server would then process the query logs to extract the transmitted data and pass the data to the analysis component of a web analytics service.

A flow chart of the above might look like this

As an augmented capability to the existing web metrics collectors

Combine “pageData=foobar” and “uniqID=userID” tokens into a single string, eg “pageData=foobar;uniqID=userID”.
Add predefined string (or checksum) into tokens string, eg “pageData=foobar;uniqID=userID;%DNS_COLLECTOR_V1%”.
Use base64 encoding to produce a string, eg “cGFnZURhdGE9Zm9vYmFyO3VuaXFJRD11c2VySUQ7JUROU19DT0xMRUNUT1JfVjEl”.
If the data string is too long, it may be compressed using common compression techniques such as gzip, or bzip2.The client would request http://cGFnZURhdGE9Zm9vYmFyO3VuaXFJRD11c2VySUQ7JUROU19DT0xMRUNUT1JfVjEl.example.org/. However should the data be compressed the client would request http://cGFnZURhdGE9Zm9vYmFyO3VuaXFJRD11c2VySUQ7JUROU19DT0xMRUNUT1JfVjEl.compressed.example.org/.
DNS resolution of the name would carry through to the Authoritative DNS server for the example.org domain, which decodes base64 and optionally verifies data validity (looks for predefined string or verifies checksum). If the optional data validation is turned off, it is assumed that the data are correct.
- - If the data are correct, responds with no IP address, eg “does not exist” message.
  - If the data are incorrect, responds with bogus IP address (eg. 172.172.172.172)
The client would load the uc.gif resource thereby transmitting the metrics to the classic collector as well as the DNS-based collector or fail instantly when incorrect data sent.
The Authoritative DNS server would then process the query logs to extract the transmitted data and pass the data to the analysis component of a web analytics service in parallel to a classic HTTP-based collector’s metrics analysis.

A flow chart of the above might look like this

Further thoughts

Suppose that that the program resident of the Authoritative DNS server which reads the query logs extracts the data and performs POST back to a classic HTTP-based pixel collector. There would now be 2 POSTs for each metric, unless something happens. However, maybe there’s an additional component that expands the receipt and processing of this one signal to many collectors. By configuration the same process could notify a plurality of collectors from this one signal sent by the client.

Method for evaluation of natural language translation engine via canon texts

Executive summary

Disclosed is the capability to leverage the canon texts in different languages to derive and evaluate the effectiveness of language translation models for application on non-canon text. This is specifically useful when generating initial language translation models for languages, which are as yet unclassified, such as native/indigenous languages.

Background and Problem statement

The word canon comes from the Greek κανών, meaning rule or measuring stick. In fiction, canon is the material accepted as officially part of the story in an individual universe of that story. In works of fiction, canon provides thus a structure for internal consistency within the fictional universe itself.

The Bible has been translated into many languages from the biblical languages of Hebrew, Aramaic and Greek. As of September 2016 the full Bible has been translated into 636 languages, the New Testament alone into 1442 languages and Bible portions or stories into 1145 other languages. Thus at least some portion of the Bible has been translated into 3,223 languages. Translations of the Qur’an are interpretations of the scripture of Islam in languages other than Arabic. Qur’an was originally written in the Arabic language and has been translated into most major African, Asian and European languages.

“Translation studies” is an academic interdiscipline dealing with the systematic study of the theory, description and application of translation, interpreting, and localization. As an interdiscipline, Translation Studies borrows much from the various fields of study that support translation. These include comparative literature, computer science, history, linguistics, philology, philosophy, semiotics, and terminology.

There are many mechanisms for performing machine translation of languages, from rule-based approaches, to the resurgence of statistical approach, which leverage word-based translation, phrase- based translation, syntax-based translation, hierarchical phrase-based translation, and etc. mechanisms. The statistical approach to machine translation is often seen as superior to the rule-based approach due to the latter’s requirement to formally develop the linguistic rules which are costly and do not apply well to the general case. Whereas the statistical approach leverages existing generally produces more fluent translations owing to the use of a language model. It stands to reason then that the efficacy of a translation job is directly related to the choice to the model used. The problem is then how to choose the model.

Historically, religious canon texts are amongst the first to be considered for translation into a new language. The formal nature of canonicity providing thus a roadmap for language scholars to agree on accurate translation of the written word, these static translations encode relationships, thus, between any two languages. This work is to extract and formalize these relationships such that evaluation of statistical language translation models can be performed in the abstract to ascertain the most effective, efficient and accurate model to be applied between any two given languages.

Novelty

A system and method for unambiguous evaluation and classification of the effectiveness or accuracy of any arbitrary language translation model between two languages.

Advantages and value

The advantage of this method is that by leveraging the peer-reviewed work done by translation studies scholars in performing translations of canon text we have ready-made ground truth of both the input and output state.

The value of this method is in being able to evaluate the effectiveness of a given statistical machine translation language model against another in the absolute. To be clear, our teaching provides the ability to train and refine the translation engine for a given language pair.

Method

Given a language pair (source/target) and a plurality of candidate language translation models

Apply one initial translation model to translate canon text in source language to generate candidate canon text target in the target language.
Perform word-based comparison of resultant candidate text against the canon text in the target language to generate a compatibility or faithfulness score. This score indicates the efficacy of the selected language model in translating from source to target as a scalar.
The result is captured as a tuple of { source , target , model , score } values.
Select another model from the plurality of candidate models. Repeat steps 1, 2, and 3 until all models exhausted
Select the model with the highest score for given source/target language pair as the model to be employed.

In some embodiments, the use of an intermediary language is leveraged to translate between a source and target text. In such cases the

Apply one initial translation model to translate canon text in source language to generate candidate canon text in the intermediary language.
Apply one initial translation model to translate the candidate canon text in the intermediary language to generate candidate canon text in the target language.
Perform word-based comparison of resultant candidate text against the canon text in the target language to generate a compatibility or faithfulness score. This score indicates the efficacy of the selected language model in translating from source to target as a scalar.
The result is captured as a tuple of { source , target , intermediary , model1, model2, score } values.
Select another model from the plurality of candidate models. Repeat steps 1, 2, and 3 until all models in all intermediary languages are exhausted.
Select the model with the highest score for given source/target language pair as the models and intermediary to be employed.

Detail

maz20170408 - Page 1

Fig 1: Selection of best model for a given language pair

Given a language pair (source/target) and a plurality of candidate language translation models

Apply one initial translation model to translate canon text in source language to generate candidate canon text target in the target language.
Perform word-based comparison of resultant candidate text against the canon text in the target language to generate a compatibility or faithfulness score. This score indicates the efficacy of the selected language model in translating from source to target as a scalar. Resulting translation is compared with canon text in target language following ways:
- If it is the same, maximum score is calculated.
- If it is different, individual words are extracted from the translation and compared with their translations in canon and ranked properly.For example, word black and white are completely different, so for that comparison score will be very low, but not minimal because both words mean a color – it is better than translation of “white” to “carrot”.
  Similarly, synonyms are ranked a lot higher, word order etc. Simply said, it is trying to find out how is the meaning of the translated text similar to the canon in target language.
The result is captured as a tuple of { source, target, model, score } values.
Select another model from the plurality of candidate models. Repeat steps 1, 2, and 3 until all models are exhausted.
Select the model with the highest score for given source/target language pair as the model to be employed.

Barbicans for Cloud Environments

Abstract

Public cloud environments require system administrators access the cloud hosts for system level activities over untrusted networks. In order to maintain perimeter security so-called jump or bastions hosts are used to reduce the attack surface. This paper discusses a mechanism to provide very strong bastion host through the use of shared SSH keys with planned obsolescence in combination with individual SSH key. The result is a bastion that is secure, with minimal burden to administrators for user access maintenance.

High Level Concepts

Barbican

A barbican is a fortified outpost or gateway, such as an outer defense to a city or castle,
or any tower situated over a gate or bridge, which was used for defensive purposes. Usually barbicans were situated outside the main line of defenses and connected to the city walls with a walled road called the neck. Deployment of two bastion hosts straddling a firewall can serve the function and neck to allow controlled access to the protected cloud environment.

Bastions

The external bastion, being outside the firewall, is exposed to the world. The internal bastion, being inside the firewall, is only accessible by the external bastion. Each bastion host provides the door while the firewall enforcement creates the neck of the barbican. Entities wanting to transit the barbican must authenticate against both bastion checkpoints.

Authentication Domains

Going through the trouble of creating a two-bastion barbican is all for naught if there is only one set of tokens that will allow transit through both checkpoints. Consequently it is beneficial to require two sets of authentication tokens to successfully transit the barbican.

Just like the concentric walls of a castle have outer walls that are lower than the inner walls, so should the stringency of the authentication tokens mimic the strength of the walls. The outer authentication domain token may be shared amongst all authorized users of the cloud environment, while the inner authentication domain token should be individualized to each user. The outer authentication domain token could be further divided by role, eg sysadmin, netadmin, appadmin, dba, etc.

Tokens

SSH’s key-based authentication is used as the mechanism for exchanging authentication tokens with each of the bastion hosts. The tokens are configured to behave differentially in each of the domains. The table below summarizes the differences.

Property of Token	Outer domain	Inner domain
Scope	Shared	Individual
Lifetime	Ephemeral	Persistent
Generation	Automated by Service	User-generated
Server-side Enablement	Periodically by Service	Once at creation time
Client-side Enablement	Periodically by User	Once at creation time

Implementation

Service Key Generation and Distribution

Cron-based, PGP encoded, upload to FTP or webserver for distribution. Once
for each role.

Service Key Installation and enablement

Cron-based, build the authorized_keys file with n-many generations, upload to the external bastion. Once for each role.

External Bastion

Simple host. No end-user accounts. Provide no services. Guard against escalation of privilege Provide command for accessing the internal bastion for convenience. Routinely clean shared user’s home directory. Use IPTables to control inbound access to ports

Internal Bastion

No shared accounts. Provide limited services. Guard against
escalation of privileges.

Firewall

Use a firewall to enforce no sideways access to the internal bastion. This is the equivalent of the neck in the barbican.

Syslog

Without monitoring all systems succumb. Simple syslog receiver to listen in on the comings and goings of the barbican.

Operations

Initial setup

User provides own public key to service for installation as part of initial procurement of authorized user access. User downloads and installs the service’s ephemeral/service key onto their workstation. The ephemeral key must be renewed periodically by the user.

Day-to-day Use

User initiates SSH session with external bastion using the service’s ephemeral key. User initiates second SSH session with internal bastion using their own persistent key through SSH Agent forwarding methods.