I first used the internet in 1987, when the software house I owned connected to the Australian Academic and Research Network (AARNET). We were granted access to this exclusive and non-commercial network because of the work we were doing in the Unix operating system space.
This was the only way to reach the internet from Australia in those days.
What it provided me was email, though a command line Unix-based editor, a file transfer mechanism, the ability to login to remote systems and access to Usenet, a facility like a bulletin board. Usenet allowed discussion of particular topics and the sharing of files via newsgroups. There was no concept of searching, other than sifting through the Usenet newsgroups for what we were after. Equally, if we wanted to publish something, we needed to put it in a place that others could find.
Entirely non-graphical, and entirely up to us to find what we wanted.
I thought it was wonderful. But, if it remained as it was, then it was never going to become what it is today. What it needed was a means to easily access the information, and a means to find it. It needed someone to invent the World Wide Web, and that person was Tim Berners-Lee.
Berners-Lee published a paper in 1989 discussing a concept to facilitate sharing and updating information among researchers. It contemplated text linked to further information, a capability he called hypertext. This hypertext linking is the foundation of the web. The paper spawned the Hypertext Mark-up Language (HTML), which provided a method to mark-up text files, so they could be displayed and interacted with. It produced the protocol to send these pages (HTTP) and the first web server to serve these files.
The first web site was the European Organization for Nuclear Research (CERN), where Berners-Lee was a fellow. It went online on 6 August 1991. For Facebook’s edification (see earlier post), I think that this is the 25 years they are referring to, but they still got the date wrong.
At this time, we have an agreed way to present, link and send material, and code to help programmers view and serve the content. But still very much all in the hands of the technologists, though.
The next piece of the puzzle came in 1993 when Mosaic was released. Mosaic was the first generally accepted web browser, and ran on Unix, Apple and Microsoft devices. Certainly, there were other web browsers available at the time, but they did not affect public use of the internet, due mainly to usability and installation issues.
The last piece of the puzzle was the ability to find material. During 1995, both Yahoo! and AltaVista introduced web crawlers and indexers. This is software that “crawls” the web, looking at every page it can find, and indexes the results. When we use a “search engine” they use our search terms to look in their indices for matches, then show us the candidates.
It is this combination of preparation and transmission, software to view the results, and a means to search for the material that is the web.
”I just had to take the hypertext idea and connect it to the Transmission Control Protocol and domain name system ideas and – ta-da – the World Wide Web”
Tim Berners-Lee
“Ta-da” indeed. In “80 Moments That Shaped the World”, a list of compiled by the British Council, the invention of the World Wide Web is ranked number one. Penicillin comes in at number two.
how big is the web?
In a previous article, I discussed just how big the internet is. So, given that the web is a subset of the internet, just how big is it? As of July 2016, the indexed web – a term I will return to – has around 4.5 billion web pages, according to WorldWideWebSize.com, a site that has developed a statistical method for tracking the number of pages indexed by major search engines. They also state that since 2012 the web doubled in size every year.
This doubling is the product of an astonishing amount of consistent use. Some of the statistics about this usage defy belief, but they are generally substantiated.
Twitter tells us they handle 350,000 tweets per minutes. If we assume the average tweet is less than the maximum – say 100 characters – and allow for the metadata containing the when, where, and who of the tweet, the result is an equivalent of around 4 million A4 pages of information being produced each day.
People upload video to YouTube at the rate of 400 hours of video every minute, according to their website. If all we did were to watch YouTube videos for ten hours a day, every day, it would take us 15 years just to view yesterday’s uploads.
The Facebook website tells us they process about 3 million posts per minute. However we might look at it, that is an astounding number of embarrassing selfies and cute kitten stories.
That is just a snapshot of social media. Given that email is not going away any time soon, be aware that, on average, we send over 200 billion emails every day. Further, over one third of the world’s population has at least one email address, and that is growing. The Radicati Group, a Palo Alto based technology market research firm, publishes these, and some other interesting statistics. If we think that is big, look at their their projections for the future.
The Google website tells us they process 4 million searches each minute of every day.
If we do some mathematics on Amazon’s sales results, it tells us they sell around $100,000 of things per minute. All transacted exclusively through their web store.
Cloud services and repositories have become both popular and affordable. Now it is common for people and organisations to consign their data, documents, images, videos and the like to purportedly secure vaults. This sector of the web is growing rapidly with services like DropBox, iCloud and OneDrive becoming commonplace. Just how big is this? Well, the DropBox website tells us their users save one billion files every day. Extrapolate that, considering that DropBox may be big, but far from dominant, and we get quickly into astronomical values. And consider that does not include corporate or government usage of more specialised services. Those volumes are gargantuan by comparison to consumer usage.
In October 2014, NetCraft, another industry statistician, confirmed there were one billion websites in the world. That number tends to rise and fall, but it seems to have stabilised at or around that number. If we assume an average of ten pages per site, then that suggests that there are about ten billion pages out there. This number of pages per site is certainly massively understated but will suffice for now.
If we indeed have ten billion pages, and the best estimates tell us we have indexed less than five billion, the question arises, “but why aren’t all the pages indexed? Yes, they might miss some, but surely not such a discrepancy?” The answer to that question is the subject of the next article. And we may not like the answer.