Mining the data mountain

Richard Law, UTC 2020-03-01 09:17

The joy of curation

In the early days of the digital era data was generally organized a bit like the contents of a library. There was usually some sort of hierarchical tree of topics which led the seeker down through ever narrower categories until something was found.

That's how our minds worked, too, in those long-dead days, whether for cataloguing a library or planning the outline of an essay or understanding the taxonomies that were used to structure so much of our knowledge of the world.

God may have started out as a geometer, but in time the Divine Mind grew older and wiser and thus came to prefer dendritic structures. Most mathematical expressions can be expressed in tree form, by the way. Trainee programmers may start with lists and stacks, but maturity only comes when they encounter the beauty and force of tree structures.

Thus early users of DOS, Windows and most other operating systems were trained from their first encounter to store data in hierarchies of 'directories'. The physical storage of bits and bytes on a disk was not at all like that, but our minds required this metaphor in order to organize data.

The bitter first lesson for all personal computer users was to save work under sensible names in sensible places and then remember where these places were. Did you file your tax return under 'bastards', 'leeches' or perhaps 'fairy-stories' – all three were possible, but 'tax returns' was probably best (or TAXRET~1, as it was then).

Apple Mac computers were delivered without any built-in file explorer for decades. Each application decided where it would save data and users were not encouraged to trouble their pretty heads about such technical matters. There was always the desktop, if needed.

From bonsai to redwood to rain forest

Despite the immense power of the hierarchical organisation of data there are two problems with it:

Firstly, as the number of content items grew – here the usually overused word 'exponentially' is quite accurate – walking around the monster tree looking for data became a frustrating task for humans, even for the nerdiest nerd.

Secondly – and most important of all by a large margin – was the problem that if a mass of content had to be made accessible, it had to be curated. In other words, someone had to enforce document labelling and content description and categorisation, which in turn meant that someone had to create and maintain this tree and assign content to the appropriate locations. Such curation was only as good as the intellectual quality of the curator. The bigger the mass, the bigger the challenge.

This degree of high-level drudgery is beyond us nowadays. We much prefer to pretend that we can just dump a pile of heterogeneous content somewhere – preferably somewhere in a 'cloud' – and let a computer sort it all out. Well, we can't.

The world wide web shambles

Computers had created this problem of information overload, and the hope was that they could also solve it. In some respects, however, they made it worse: they allowed unlimited linking to other items – linking, that A and O of web technology – which turned the tree structure into what we now call a 'graph', a structure with links that can lead in all directions. Wikipedia, with its attempt at crowd curation, has driven this technique to the borders of insanity.

In comparison with the relatively orderly organization of the early message boards, the domain structure of the web itself turned into a shambles that defied all curation: the mixture of top-level, lower-level and internationalised domain names has been driving users batty ever since its introduction in 1983.

To be fair, the result was never intended to be an organization of content itself but of content providers. Recent ventures into ~~providing~~ selling ever more identifiers for particular interest groups have made an existing chaos even more chaotic.

This shouldn't really surprise us: the gigantic undertaking of the internet and the web is founded on money, not altruism. It was, ultimately, commerce that made it work. All those who shout about internet 'freedom' – something that has never, ever existed – should recall the old adage: he who pays the piper calls the tune.

It became clear early on that the traditional curation of this unimaginably huge pile of content was impossible. Archivists always like to say that although it is important to store material (which the web certainly does in grand style), it is just as important to be able to find it again. In an uncurated system the only possibility left was an automated search across all the materials in this pile. As the Google founders quickly realised, the people who own the search, own the web.

The keywords dead-end

In the early days of internet searching much emphasis was laid on including keywords in documents, but this requirement was given up in short order: the creators of internet content are completely useless at supplying useful keywords or even a halfway acceptable abstracts for searching. Worse still, the shady sites in the shady corners of the web soon learned to create keywords that would game or mislead the system to their advantage.

Your author has experienced at first hand the difficulty of trying to get otherwise very bright people to supply sensible keywords for their own content. Spelling keywords consistently would have been a start, as would using a standardised dictionary of keywords. It never happened.

As for writing an abstract of a text, in the heat of the working day nearly everyone ends up just copying and pasting a paragraph or two from the body of the text. Doing this can even be counterproductive for Google searches: the search engine takes a dim view of blocks of repeated text and may even downgrade the page in its quality assessment and ranking. Creating keywords and crafting searchable descriptions require time, effort and skill; writing good abstracts arguably more skill than writing the original text.

Full text search

The only solution was a multilingual, full-text search through the pile of content itself. Combined with clever contextualisation and a large dollop of advanced linguistics, search engines were able to deliver mountains of hits to match the mountains of content which they had searched.

But then what? What is the point of delivering to the user potentially millions of hits from the entire content of the web? Who can cope with that? Clever semantic optimisation of a search might spare the user a million or so results, but a couple of million might still remain. Most users appear to find more than ten hits to be a bit of a challenge.

In effect, having ducked the problem of curating the data as a whole, we were now faced with the problem of curating the search results.

Ranking search results

Google's basic search algorithm is a work of coding genius, but its solution to the problem of the delivery of all these results was also clever: page ranking. It was the technique that made Google its fortune and gave the company its market dominance.

Page ranking meant that domains and web pages were weighted according to their 'importance'. Big, busy, 'authoritative' websites were ranked highly, after that came the websites to which they linked, then finally, the lonely and the ignored mini-planet websites at the edge of the cultural solar system with their weedy and ineffectual 'blogrolls'.

Out there in that cold darkness were the personal blogs, the hobby horses, the travel diaries, the baking and the craftwork websites, the collections of images, jokes and memes. Many of these have now been scooped up into the warmth of the social media gas giants.

If the search term was very specific and matched an equally specific term on a lonely website, that website might rise up the rankings. But as soon as it had any semantic competition from a big website it would drift downwards again. You can't beat gravity – and gravity comes from mass and mass comes from aggregation.

Well, we can all get behind some ranking system, since results have to be sorted in some way for the user. Most people, after all, will want to visit the websites where most people already go.

Google's golden eggs

But then Mammon intervened, he who had always been at the centre of the development of the web. Google's immense infrastructure has to be paid for somehow and this is done by the sale of page rankings. The 'importance' of a page became another way of saying its 'commercial value'.

Many vilify Google for earning large revenues, but that money comes from providing the merchants of the world with a service that they clearly want and providing users with a free, sophisticated search tool, something they clearly need.

Without a good search engine, the web would now be a crippled, nearly useless thing. But it is important for users to realise that Google is not just a search, it is always a search with benefits – benefits for Google, that is.

If you search using Google for a non-commercial term, you might think that the results you receive will be relatively unmolested on their path to your browser. Wrong. Once past the commercial gatekeeper the term has to pass the acceptability gatekeepers, lists which will determine whether the results of your search are throttled in some way or ranked badly.

Websites on which the Google search engine unearths profanities, for example, are immediately banished to the outer darkness. Websites which still use the old HTTP protocol rather then the more modern, secure HTTPS protocol are also pushed down (how far, Google refuses to say). Whatever other criteria – political, cultural, moral etc. – which Google uses to determine page rank are kept secret. It seems likely that there are very few sites which get past these gatekeepers without some intervention.

For most searches that appear at least on the surface to be non-commercial, Google has a policy of prepending a hit to that other distorting mirror, Wikipedia. Once the knowledge box has been ticked in this way, Google is then off the boon-to-man, do-no-evil, information-provider hook and now free to list URLs from its paying customers such as tourist sites, booksellers and so on. Oh, and of course that other click goldmine owned by Google, YouTube. Kerrching!

Every search is skewed

If, however, you search for a commercial product (even including galleries and museums) many – sometimes very many – of the first pages of the search will be there because someone has paid for them to be there. There will be pages and pages of results from online suppliers.

One recent Google search your author carried out shocked even him with its brutal commercialism and blatant distortion. The first five pages returned were filled with variant links to a local monopolist who had clearly paid Google well for this astonishing prominence. The manufacturer of the product came top of the list and pages of subsequent links directed to every shop in the region which stocked this product. Those first five pages were essentially occupied by a single product. Clearly, the monopolist's competitors had forgotten to tip the doorman, for the competitors' sites appeared many, many pages down – and widely scattered, at that.

On DuckDuckGo, which is much less commercialised than Google, the same search term brought up within the scope of two pages half a dozen links to the monopolist's competitors, all of them offering prices substantially below the monopolist. Your author ultimately obtained the product at 50% of the monopolist's price.

We paranoid personalities begin to wonder whether or not, when sufficient silver has crossed the Google doorman's palm, links to the monopolist are not just boosted, but links to competitors are also pushed down. Readers should try Google's mysterious page ranking out for themselves and always be aware of what they are dealing with when they are searching online. Remember: Google is selling your search, whatever it may be for, to the highest bidders. Every time.

Google's image search in particular is also intensely commercialised, with the photo agencies heavily over-represented in any search. They are paying Google to boost their products and Google is obliging them.

Wheat and chaff

Even when there is only little intervention and manipulation of results, the sheer mass of the commercial offering can overload an information search. Take an example close to this website's heart: if you search with Google for 'Schubert' or some derivative or named piece of his, you will get hundreds – perhaps thousands – of links to Big Classic, the music companies who make a living from selling performances of his works and who have paid Google to promote their websites. Exactly the same products appear over and over again in the web offerings of distributors, merchants and reviewers.

The same search on DuckDuckGo will bring you a more varied set of results, though still dominated by the sheer mass of commercial offerings. Even without the obvious distortions of the Google page ranking system, the general web has become a forum for buyers and sellers – a huge community which by its sheer mass smothers non-commercial sites.

Well, every web user knows this – or ought to. The rule is: if you are searching for a commercial product, don't do it on Google unless you have to. And if you do do it on Google: caveat emptor. And, if you are searching for answers, never, ever ask Google a question.