# Maintaining a DNS-blocklist



## getopt (Feb 10, 2017)

I'm maintaining a DNS-blocklist for blocking mainly ad- and tracking-sites. The hostnames come from blocklists found on the Internet.

What I did now, is doing a diff on a list that I have kept from 2016/Aug (12233 lines) with the new one (12912 lines).

From the diff I extracted a list of 688 new host names which were added during the last 6 months. 
This list I let trying to resolve with DNSSEC enabled.

From these 688 host names only 306 did resolve, while 382 did not.

Now what I'm interested in is, how would you interpret these findings?

a) regarding DNSSEC enabled
b) regarding the quality of the blocklist source
c) regarding the volatility of hostnames

Final the practical question: Should the unresolving hostnames still be added to the blocklist?


----------



## obsigna (Feb 10, 2017)

getopt said:


> ...
> This list I let trying to resolve with DNSSEC enabled. From these 688 host names only 306 did resolve, while 382 did not. Now what I'm interested in is, how would you interpret these findings?
> 
> a) regarding DNSSEC enabled
> ...



a) I would be interested to learn whether the count of non-resolving domain names changes when you disable DNSSEC.

b) Hard to tell from the additions only. Did you find some entries that were removed? If not, and the blocklist maintainer does only add domains, and more than half of which are inactive already, then I would tend to assume that the quality may be questionable :-D

F)inally: I am running a similar system on my home server. Once a month the DNS blocklists are downloaded from various providers, and are compiled into a consolidated void-zones list for inclusion by the unboud.conf(5) file.

See: https://github.com/cyclaero/void-zones-tools
And: dns/void-zones-tools

Personally, I don't care about entries that are anyway inactive, because:

1) It may happen that the domain name is inactive but is still included in some web content, so leaving those entries in the block list would result in quicker response times.

2) Even with more than 35000 void zones, unbound takes far less than a ms (always Query time: 0 msec, let's say 0.3 ms) for locally resolving any host. So, if I would spent 30 minutes for reducing the number of zones to the half, and by this the cost of a single query would be reduced by 0.1 ms, then my effort would be payed back after 18,000,000 queries, which doesn't sound like a good deal, does it?


----------



## ronaldlees (Mar 30, 2017)

obsigna said:


> ... Personally, I don't care about entries that are anyway inactive, because:
> 
> 1) It may happen that the domain name is inactive but is still included in some web content, so leaving those entries in the block list would result in quicker response times.
> 
> 2) Even with more than 35000 void zones, unbound takes far less than a ms (always Query time: 0 msec, let's say 0.3 ms) for locally resolving any host. So, if I would spent 30 minutes for reducing the number of zones to the half, and by this the cost of a single query would be reduced by 0.1 ms, then my effort would be payed back after 18,000,000 queries, which doesn't sound like a good deal, does it?



Gotta say, I agree with this.  My system is much quicker when it's running with unbound, even though I'm blocking only a few of the ad domains (the heavy hitters, I think).  It'd be nice to have a "snapshot" of the ad domains that could easily be included via an "include statement" inside the unbound config, but without having to run any scripts in order to download the domains.  I don't need it to be that much of a real time update. Chances are, that I'd just download the data once in a blue moon.

One thing I've noticed is intriguing me.  Some sites (tabloid-like news sites in particular) cause a real blast of DNS traffic.  One site caused 1300 DNS queries in three minutes of page views.  I admit I was moving around quickly, but ...

What was a little strange was the observation that there were not that many associated http or https connections (maybe a few dozen).   I'm beginning to think that the DNS BY ITSELF is being used for tracking purposes.  It'd be quicker, and in some cases (by name inference) - be plenty of info for the ad guys.

I suppose this is old hat/obvious to most people here


----------



## ronaldlees (Mar 30, 2017)

I guess I'd prefer for them to use DNS, since it almost always would carry less information than an https connection.  The DNS method gives them the BIG ONE tho: the IP address.

Later edit:

After studying the DNS queries for awhile, it appears that most of them do not relate to  independent streams of downloaded (tcp) data.  So, they're a tracking method _in and of themselves_, basically, at least on some tracker heavy sites.  A normal unobtrusive site has 15 - 25 DNS queries, and the majority of those can be associated with _real tcp traffic_.  Other, tracking heavy sites can have thousands of queries (one page had over a thousand, as noted above), but a very low percentage of them result in independent tcp traffic streams.

This gives these sites the ability to track users even when the javascript is turned off. If I cache the queries with unbound, the next page-full that I get doubles the number of trackers (sometimes). Yet another page-full triples, and that is how the aforementioned page managed to pull over a thousand queries.  It's_ very_ relentless, and extremely dynamic,   with DNS names made up on the fly.  The fact that the server-side script knows to increase the DNS barrage means there is tight integration between them. All this can be done without javascript on the client.

So, the loading of a configuration file filled with a "black list" will never completely stop it, because the method is so relentless.  I determined that I have to run unbound in order to build up a big cache, and then just firewall the DNS port.  Relative to the other streams (http/https, etc), the blacklist might slow down the deluge, but probably not stop it.


----------



## ronaldlees (Apr 12, 2017)

getopt said:


> From these 688 host names only 306 did resolve, while 382 did not.
> 
> Now what I'm interested in is, how would you interpret these findings?
> 
> ...



Yes, they should be added.  They're for tracking _in-and-of-themselves_, and meant not to resolve - at least in some cases (see my last post). But, I think the black-list method in general will fail (if used only by itself) because of the dynamics. It'll slow down the deluge tho, so it is useful ...

So, the ad servers have six prongs of attack:


1 - Normally resolved addresses associated directly with their https/http servers.
2 - Unresolvable addresses that are really for DNS tracking by the DNS itself.
3 - Buried addresses that are resolved server-side, and used as in item 1.
4 - Cookies in various forms (shared or not) - lesser now because people are adverse.
5 - Persistent super cookie client storage (because #4 is unpopular, often purged).
6 - Persistent image cache tags.

I'm seeing more of item 3 (where IP addresses are just included in the page). I don't see how you deal with those addresses, other than dynamic firewalling, based on (again) - the blacklist, which won't be all-inclusive.  Maybe, a source scrubber could be used, but that'd be browser specific. Alternatively, maybe a proxy/scrubber could be set up so as to make the process browser transparent.

_Experiment: do a dump of an unbound cache:_

`unbound-control dump_cache > thedateandtime.cache`

Then, watch your outgoing connections, look for suspicious ones, and plug those addresses into a `grep` on the cache file (or just do both on one line):

`grep "x.x.x.x" thedateandtime.cache`

Doing this, (assuming you've firewalled the DNS port) - will let you (occasionally) catch the item #3 stuff, mentioned above (in other words, in such cases a connection was made without the benefit of DNS).  It would be beneficial to automate this kind of thing. Note that to load a cache file into `unbound`:

`unbound-control load_cache < thedateandtime.cache`

One can manually add an item to cache:

`unbound-control local_data "theservertoadd.tld A x.x.x.x"`

... or delete existing local data:

`unbound-control flush theservertoflush.tld`

One other thing:  browser DNS cache can upset the apple-cart, since it seems to drop cache entries unexpectedly (only on one browser brand I noticed).  So, I always turn DNS browser cache (browser based resolver cache) off when I can. Due to item #6 in the list (above) - I usually also disable image caches. Disabling local storage for super cookies sometimes causes sites to break, so it's tough to turn it off unless one is not using javascript (in which case the site might be, by that fact, broken anyway). One can flush it regularly, though.  Number #5 and #6 are kinda similar, but listed separately because they usually need to be disabled via separate mechanisms in browsers.

I could see certificate authorities hooking up with ad people,_ to use OCSP for tracking_. I've no proof of such a thing, but now that DNS is being so thoroughly abused ...

One more thing:  with or without javascript, more and more sites are blending ad server functionality with site funtionality, so if you don't take the syrup, you don't get the puddin.  It's turning internet  into pay-for-view television, naturally.  It's the progression.

For legal purposes:  I'm not an expert, this is all uneducated opinion, and so please don't do what I do.   Whew! Since this is not my blog, I'm done.


----------

