In the last but one article I raised the apparent discrepancy between the number of web pages and the number of pages indexed by search engines.
Those very simple statistics showed a discrepancy of 50%. In other words, the search engines only know about half the pages that are out on the web. Unfortunately, this inconsistency is hopelessly understated.
There are experts, including Google themselves, who believe that the discrepancy is far greater, and that as little as 4% of the web is indexed. They use a much more accurate estimating method than I used in the previous article.
There are good and valid reasons why this is so. But, before going into the explanation, we should understand how search engines like Google work.
When we enter a term into their search box, the search provider does not actually search the web. What they do is examine their index of web pages and return results from there. They are more of a sorting engine than a search engine.
It is the compilation of these indices that explains why so little of the web is understood by search providers.
Google, Bing and other search providers are constantly crawling the web indexing the pages they find, using instructions left by the site owner. Aside from providing guidance on how to index the content, these instructions can tell the search engine not to index the site, or not to index certain pages.
Further, if the page content is secured, and needs credentials to be viewed, it cannot be indexed as the content is not accessible.
Many web pages are dynamic, in that their content is computed as required. A good example is the Australian Bureau of Statistics site, which builds content based upon our request. These types of pages cannot be indexed, generally. Note that there are ways to index certain dynamic page content, and shopping site catalogues are an example.
As a result, we find that the index only has material that the site owner wants indexed. If it is not in the search provider’s index, it is not available as a search result.
This set of pages that can be indexed is called either the “surface web”, or the “clear web”.
The rest of the pages, the ones that are not indexed, cannot be indexed, or are secured, make up what is known as the “deep web”.
This is where our banking records, our health records, our tax records and other confidential information lives. It is where member or subscriber-only pages live. It is where the information published by corporations only for their customers lives.
It is where most social media lives. Facebook, Instagram and LinkedIn live in the deep web. We need to hold accounts – i.e. have credentials – to access the content they contain.
The deep web is massively, massively larger than the clear web – and most of it is inaccessible because it is protected.
None of the above could be considered nefarious, although there is plenty of illegal or morally dubious activity in the deep and clear webs. The really bad or really crazy people exist on the “dark net”.
The following infographic describes the clear, deep and dark parts of the web.
We access the clear and deep web using the common web tools that come on our devices, be they a phone, tablet or computer. Going into the deep web usually requires some form of credential, often just an account secured by a username and password. But we are still using common tools.
The dark net is different. We need special tools and a lot more knowledge to successfully access it.
The most used such tool is called the “tor” browser. Tor is an acronym for “the onion router” and it is delivered by the Tor Project. The bulk of the funding for Tor’s development has come from the federal government of the United States, initially through the Navy and DARPA.
Yes, that is right. The US Government and its military fund the tools that the dark net uses.
Tor directs traffic through a free, worldwide, volunteer overlay network. In 2017 this network had more than seven thousand such relays. The aim is to hide a user’s location and usage from anyone conducting network surveillance or traffic analysis. To an external observer, Tor traffic looks like a lot of random packets containing gibberish bouncing between nodes.
Onion routing is the underpinning of Tor and it has levels of encryption nested like the layers of an onion. Tor encrypts the data, including the next node destination address, multiple times and sends it through successive, randomly selected relays. Each relay decrypts a layer to reveal the next relay in the circuit to pass the remaining encrypted data on to. The last relay decrypts the innermost layer and sends the original data to its destination without revealing or knowing the source address.
Aside from hiding users, Tor also provides anonymity to websites and other servers. Servers configured to only receive inbound connections through Tor provide what are called hidden services. Rather than revealing the server’s internet address, a hidden service is accessed through its onion address. This is very different from the DNS addresses discussed earlier.
I will discuss the dark net more in a later article, but for now I will just discuss what the dark net is used for today. Before proceeding, I want to offer some advice, and I (mis)-quote from the Disney movie, “the Lion King”.
“Don’t go there, Simba”Mufasa, King of Pride Lands
Our common perception is that the dark net is where terrorists, criminals and spies live. We think of places like “the Silk Road” or “the Pirate Bay”. We think of places where industrial-scale piracy of software, video and music takes place. We think this is where we hire hitmen. We think this is where we get child pornography. But, it seems this is only partly true.
It turns out that a lot what goes on in the dark net is legitimate.
The dark net is a place where whistle-blowers can communicate securely. Most news media sites use communications channels in the dark web to allow their journalists and information sources to collaborate in a way that helps protect the source. The infamous WikiLeaks site operates in this way, as do many others.
Many governments practise internet censorship, blocking content and access to certain sites. Even our own Australian government has made laughable attempts to protect its citizens from the internet. But there are repressive regimes who block their peoples’ ability to engage in free speech. The only way to hear a world view, and to tell others what is going on is to use the dark net. Legitimate? Probably. Legal? Most definitely not. There are regimes where such activities can lead to incarceration or execution.
There are anonymous support groups for the victims of domestic violence, rape and other situations where the victim’s anonymity is important.
A lot of intelligence gathering activities use the dark net. Spies and their spymasters collect their information and communicate in a quite secure environment. Legitimate? I guess that depends on which side we are on.
The “foil hat” brigade use the dark net. These are those people whose desire for anonymity is paranoiac and obsessive. What is worrying to me is that several places pander to this group. There is access to Facebook from the dark net. We must ask the question why would anyone want such protection of their identity on a social sharing platform? I would have thought that it defeats the purpose. What worries me is that it facilitates the most egregious cyber-bullying, and frankly I can see no other reason for it. This is something I will take up in another article, where I will look at the Facebook problem.
For those of us who qualify for the foil-hat brigade there are Dropbox-like file storage and sharing services. These enable us to securely store and share our kitten photos with like-minded people, with little fear of being discovered.
But, of course, it wouldn’t be kitten photos. It would be pornography, terrorist instruction manuals, stolen credit card details, hacking resources, or the plans for someone’s latest battleship.
The Electronic Frontier Foundation (EFF) believe that half of the dark net is legitimate, or at least sanctioned, and the other half is downright illegal. I will return to the EFF in a later article.
The dark net is more of a marketplace than an information repository. So, what could we buy today on the dark net?
For the equivalent in Bitcoin of AUD2.90 we could buy anyone’s Australian Medicare details, with discounts for bulk purchases. Why would we want to? Because if we were going to do a Medicare scam, like many before us, we would use these details to help convince the Australian Department of Health to transfer large amounts of money to us. It can also aid identity theft.
This is just one example of the data we could buy. The full list is eye-watering.
I should point out that this data breach is well-known to the Department of Health, and steps have been taken to address this. The example is purely to show what is out there, and not provide instructions on how to commit fraud. I don’t want to be sent to gaol.
In terms of merchandise, the list is appallingly complete. The EFF estimates have child pornography and illicit drugs making up around 20% of dark net traffic, with things like weapons much smaller. Pirated media is even smaller.
How big is the dark net? Once again no-one truly knows. The best estimate I have seen is from the Tor Project themselves, who claim that around 4% of the web traffic over their network is directed to the dark net. Tor is – probably – the most common way to interact with the dark net but there are at least two other popular ways to access it. I2P and Freenet are commonly used, with I2P concentrating more on the server side. Of course, many hackers will not use such packages, and will have built their own access methods.
Given the Tor browser’s miniscule market share, their claimed statistics, and the other access methods, I estimate that only 5% of the web is in the dark net. Which merely makes it huge, rather than gargantuan.
But, the very nature of the dark net prohibits analysis. It is hidden, and most sites are set up to ignore trivial queries. It is impossible to measure.
Returning to Mufasa’s warning; there are two reasons we shouldn’t go to the dark net.
The first is obvious. It is an unpleasant, unsavoury, unfriendly place, where we will be ripped off. We are innocent naïfs as far as its denizens are concerned. We will meet people who only speak Russian and offer “crime as a service”. We will meet people who believe they have a right to have sexual relations with children or animals or both. We will meet people who have container-loads of cocaine to sell – or maybe they don’t.
The second reason is less obvious. In most countries the law-enforcement, anti-terrorism and counter-espionage organisations watch the internet closely. Our use of Tor or its equivalents leaves a recognisable footprint and attracts the attention of these agencies.
A suitable metaphor is to walk into a supermarket wearing a balaclava. The people there may not recognise us but they surely identify the balaclava. The assumption they make is that we must be considering something wicked if we are wearing a balaclava.
The agencies think the same way. They can identify our use of obfuscation technology and will immediately become suspicious about why we need to use it. In their view we are associating with drug dealers, terrorists and pornographers. We are disguising ourselves. Therefore, there is probably something to investigate.
Most of us would prefer not to be the subject of an agency investigation. There are times when just being an indistinguishable part of the herd is a good place to be.
So much for the clear, deep and parts of the web. Remember that underneath all of that is the internet. Not only does it support the web, it is also supporting astronomical volumes of machine to machine (M2M) communication,
What are these astronomical volumes of M2M communication made up of? Part of it is related to moving transaction data around, such as banking transactions. Part of it is moving files around, as seen in software delivery. But the largest part is the inter-communication between sensors and controllers. These sensors and controllers include industrial systems, but from our perspective, the most interesting uses are in our cars and smart homes.
This is the “internet of things”, or IOT, and is a topic I will return to in a later article.
In the next article I will discuss how we use the web and how it uses us. And a good place to start is with our notion of “free”.