Posted by Austin Morris on  UTC 2019-03-04 17:50

The gross defect of the Google 'Custom Site Search' was mentioned on our search page some time ago: as a punishment trade-off for our rejecting its advertising, Google was filtering its results from Figures of Speech to return only an arbitrary and very small subset of occurrences (that is, for our local search, not on the web as a whole).

Few of our readers will be bothered about this, but our decrepit and amnesiac authors need to check from time to time to ensure that they are not just serving up the same tosh they previously doled out but had forgotten about, so unmemorable had it been. And, of course, if they are repeating themselves, then they should as a minimum not be contradicting themselves too blatantly.

Something had to be done. But what?

Well, firstly, let's not give in to the kneejerk response of implementing a full text search of the HTML pages on the Figures of Speech website.

A full text search is an appropriate tool to explore the vast, inhomogeneous, multilingual wastes of the worldwide web. Google invests a lot of skill and linguistic knowledge to make these results relevant for the seeker – or at least to seem so. The skill is required to sift out the one percent or less of meaningful content from the mountain of irrelevant garbage, and to do that in real time.

Compared with crawling across the wilderness of the worldwide web, searching a single, hermetically sealed website offers scope for optimisation of the search procedure, particularly when the website – as in the case of Figures of Speech– is built on data that is tightly structured according to linguistic and semantic principles. The reader's eyelids begin to droop: 'What on earth does that mean?'

It means that the HTML pages that are sent from our server to your browser are merely a presentation – a refraction – of the underlying offline content, which is organized very strictly into formal linguistic elements. One of these offline pages looks nothing like the HTML page that the visitor sees.

If one just searched the HTML, then a lot of computing effort would be required to parse that content into meaningful units. If, however, one searches the underlying data, this is already structured in way that makes it much easier to parse, allowing us to deliver much more relevant search results. In this structured content we know exactly what we are looking at, whether paragraphs, headings, subheadings, tables, lists, image captions, citations or… whatever.

A further advantage of using the underlying data is that we can not just identify and link to the page on which a search term can be found, but rather to the object in which it occurs. The result is that the user is taken exactly to the relevant result within any page and does not need to waste time doing a second search of the page itself in order to find the precise occurrence of the search term.

Furthermore, for every full-text search query which the user issues, the search engine has to process in real time in some way or other the index of the entire source material. In contrast, the database of search results for Figures of Speech only needs to be built every time new content is added – an offline process that takes a second or two. In between these relatively infrequent moments the search database can be made accessible for site visitors using simple retrieval techniques that take only milliseconds to execute.

There are no off-the-shelf solutions available to do what we want, so there was nothing for it but to wake up the old codger in the basement, who started his coding life all those decades ago with Fortran and Algol on punched cards. It's amazing what can be done on a few bottles of cheap red wine. After a couple of weeks he came up with our new search engine: an offline data processor, a text extractor, a corpus generator and an online database and a user interface.

There are currently around one and a half million words on the Figures of Speech website - most of them, admittedly, extremely uninteresting. We could cope with this mass of data in a technical sense, but the uninteresting nature of most of it makes the result annoyingly trivial and a waste of time and effort.

We drastically reduce the mass by ignoring keywords shorter than three letters. So if you search for 'an' you are going to be disappointed. You will, however, be shown 2,650 words that start with 'an' and you will see how fond we are of 'anonymous', 'anodyne' and 'anomaly'. Now, that's what's called added value.

As a further sabre slash at the monstrous pile, any keyword that occurs more than 200 times is discarded: the system is not a linguistic analysis tool, after all. At this moment, for example, we say goodbye to 21,727 'and's and 63,229 'the's – as well as around 500 other such words. Some proper nouns related to our fetishes on this website survive the sabre's blade, though: there are 2,671 'schubert's, for example – we can't discard him!

Other categories of keyword are filtered out: dates and years, for example, regnal numbers and citation minutiae. In the end the one and a half million turns into (currently) 377,959 keyword entries with their contexts.

If you choose to type in 'franz', at the time of writing this you will obtain a list of 1,158 occurrences of that fine name. The system takes the search term the user types in as the starting point for the keyword. Thus, if you type in 'fran' you will get all the 'Franz'es as well as 'Frankenstein' and much else (1576 entries in all).

The search is quite brutally careless of case and diacritics. If you don't have accents on your keyboard, just type in 'schon' and you will get 'schöne'. 'napoleon' will find the old monster, whether we have written 'Napoleon' or, pretentiously, 'Napoléon'. Beware the Greeks typing in keywords: the keyword 'ἀοιδήν' is there, but currently the input filter will have none of it. This will be changed soon – probably.

The search delivers a so-called KWIC table, 'KeyWord In Context', a layout much loved by the concordance and linguistic analysis community. Although the keywords are filtered, the contexts are not: dates, regnal numbers – everything is there. This allows the searcher quickly to scan the results and find the most relevant 'franz'. The entries appear in date order, with the oldest occurrences first.

A click on the arrow symbol at the end of each entry will open the relevant page in a new browser window at precisely the paragraph in which the selected entry occurs. The rest of the KWIC entry is insensible to clicks, so you can select and copy at will.

The search is essentially based on single keywords; if you search for 'saint' or 'augustine', the name of that fine thinker will pop up in both listings, from which you can then take your pick, but nothing will appear of a search for 'saint augustine'. In other words 'less can mean more'. There are a few exceptions to this. Some phrases will yield results, such as 'good old days' (but will also appear in a search for 'good', but not for 'old' or for 'days'). The quantity of these phrasal keys will gradually increase over time.

Some of our more annoying readers will be asking themselves where the type-ahead function is. The answer is that during development it became clear that it would be more trouble than it was worth. Three or four characters followed by the return key produce relevant results much faster than a type-ahead function, since, unlike Google, our search does not require full words in order to produce useful results.

So, the codger in the basement, a good job well done, can now go back to playing Packman. We may even stop repeating and contradicting ourselves quite as much as before, but this is unlikely.

The picture by Joan Miró used for the menu thumbnail reflects the state of mind of someone who has just developed a search engine. Here it is, at a useful size:

FoS image, size 708x537

Joan Miró (1893-1983), L'Oiseau au plumage déployé vole vers l'arbre argenté, 'Bird with Unfolded Plumage flies to the Silver Tree', 1953. The painting was bought at auction in 2015 by a private collector for 9,154,500 GBP, making each pixel of this image worth a shade over 24 GBP. Image: Christies, London [Click to open a larger version in a new browser window, 2000 x 1517 px.]

0 Comments UTC Loaded:

Input rules for comments: No HTML, no images. Comments can be nested to a depth of eight. Surround a long quotation with curly braces: {blockquote}. Well-formed URLs will be rendered as links automatically. Do not click on links unless you are confident that they are safe. You have been warned!

Respond
Name  [max. characters: 24]
Type   into this field then press return:
Comment [max. characters: 4,000]
Post
Cancel