O corpora o mores

Selected n-grams from the Royal Navy logbooks.

One of the fun and useful tools Google have provided for us is an n-gram viewer. N-gram is a ugly but short term meaning a phrase containing n words: so ‘rain’ is a 1-gram, ‘clear sky’ a 2-gram, and ‘overcast with squalls, hail, thunder and lightning’ a 7-gram. Google’s tool shows us how common selected phrases are, and how their use has changed over time – I’ve found intriguing results with meteorological instruments, and ship details, for example.

How common is the word ‘barometer’? begs the question ‘how common in what?’ – English newspapers?, Canadian novels?, Spanish poetry?. The block of words used for the search is called a corpus, and Google offers several to choose from; but we don’t have to use any of them, because we have made our own. We actually have two corpuses corpora to choose from: one from the Royal Navy WW1 logs, and a second from the US Arctic logs we are working on now. We don’t have a convenient web tool, but we can still search out the common words and phrases.

In the RN logs the most common 1-gram (word) is ‘to’, but if we disqualify ‘to’, ‘in’, ‘and’, ‘but’ and the like, the top 10 are ‘sick’, ‘ship’, ‘HMS’, ‘list’, ‘proceeded’, ‘hands’, ‘discharged’, ‘joined’, ‘arrived’ and ‘left’ – almost enough in themselves to give a sense of the naval language. Looking at popular longer phrases gives an even better picture: ‘joined ship’, ‘weighed and proceeded’, ‘hands employed cleaning ship’, ‘came to with port anchor’; all the way up to the likes of ‘Ship in dockyard hands for refit. Ship’s company employed as required on board and accommodated in sailors home’.

The US Arctic logs are much more variable, and we have not yet accumulated a really large corpus, but we can still find the popular n-grams, and they are quite different. The top 10 words are ‘ice’, ‘drift’, ‘lead’, ‘indicated’, ‘line’, ‘being’, ‘ship’, ‘large’, ‘slight’, and ‘pack’, and we can clearly detect a very desirable obsession with the sea-ice. Longer phrases include: ‘lead line’, ‘a slight drift’, ‘indicated by the lead’, ‘inches in thickness formed over sounding hole since noon yesterday’.

