Markup Language Codes

Some Handy Tweaks as per IETF/IANA/W3C Standards

If you're the kind of person who looks under the hood of the webpage you're reading, you might have noticed that I've defined this website as being written in what the linguists call "Standard Scottish English". Or if you're the kind of person who hasn't a clue what I'm on about but this vaguely tickles your interest, just right-click a blank space anywhere on this page and, in the pop-up dialog, select whatever option looks closest to "View Page Source" to open a new tab showing you the page's underlying HTML code, near at the top of which, you'll see this line:

<html lang="en-scotland" dir="ltr">

It's obsessively-detailed nonsense like this that the vast majority of coders either aren't really aware of or just can't be bothered with, and most just leave things at lang="en" ("English") or even at that most insidious form of digital cultural imperialism, lang="en-us" ("American English"). In the UK, of course, things often just default to lang="en-gb" ("British English"), which is at least a more local form of cultural imperialism to those of us writing outwith the so-called Metropolitan Centre! But if cultural difference matters to you in this ever-globalising world wide web of blind acquiescence to assumed (US) norms, you might want to think a bit harder about what your code says about what you say (or in this case, how you say it). And, thankfully, over the years, a way of doing it in ever-finer detail has developed, with a proliferation of sometimes surprisingly-specific language codes that we can use to tightly define the cultural specificity of all sorts of digital artefacts. In formal terms, it can be expressed as follows:

Attribute: lang
Value Syntax: language-[extlang]-[script]-[region]-[variant]-[extension]-[privateuse]

The main point of interest here is the language bit, which is where we define the language of the digital artefact itself, so most of what follows below is about what we can populate that field with in, for example, an HTML attribute's value like lang="en". But let's not get too carried away with formalities here. You can read up on all of that nonsense in some of the references linked at the foot of this article, if you've a mind to. What's more to the point is the languages we can now define things like HTML elements and even whole documents in — ie. the hands-on, practical, fun bit. So here's my great big list of language codes that can be used straight off the shelf next time you decide to digitise something wonderfully obscure. And yes, they are all real, accepted definitions in The Standards themselves (cf. ref.s at end) — they are very much not my own fevered hallucinatory inventions! In my case, I got into all of this when I decided to digitise some foosty old texts in languages like Scots, Old English, Middle Welsh and Ancient Greek and realised that lang="en" had suddenly become a humungously-glaring problem. Well, turns out there's a solution, so here we go…

1: Generic language Values

ine
Indo-European languages
cel
Celtic languages
gem
Germanic languages
gmq
North Germanic languages
gmw
West Germanic languages
roa
Romance languages
sgn
Sign Languages

2: Continental Celtic

cel
Celtic (generic)
cel-gaulish
Gaulish Celtic
xcg
Cisalpine Gaulish
xtg
Transalpine Gaulish
xce
Celtiberian
xlp
Lepontic
nor
Noric
obt
Old Breton
xbm
Middle Breton
br
Breton
xga
Galatian

The following includes languages whose Celticity is less certain:

lij
Ligurian
xls
Lusitanian
xve
Venetic

3: Insular Q-Celtic

cel
Celtic (generic)
sga
Old Irish
mga
Middle Irish
ga
Irish
ghc
Hiberno-Scottish Gaelic
gd
Scottish Gaelic
gv
Manx

4: Insular P-Celtic

cel
Celtic (generic)
xcb
Cumbric
xpi
Pictish
owl
Old Welsh
wlm
Middle Welsh
cy
Welsh
oco
Old Cornish
cnx
Middle Cornish
kw
Cornish
obt
Old Breton
xbm
Middle Breton
br
Breton

5: Insular Germanic

gem
Germanic (generic)
ang
Old English
enm
Middle English
en
English
en-gb
British English
en-gb-oed
OED English
en-scouse
Liverpudlian English
en-ie
Irish English
en-scotland
Standard Scottish English
non
Old Norse
nrn
Norn
sco
Scots
sco-ulster
Ulster Scots
trl
Traveller Scottish

The following includes Romance languages and their Insular Germanic variants:

roa
Romance (generic)
fro
Old French
xno
Anglo-Norman

6: Romany/Traveller

rom
Romany
rme
Angloromani
rmw
Welsh Romani
trl
Traveller Scottish

7: Sign Languages

sgn
Sign Languages (generic)
sgn-mzg
Monastic Sign Language
sgn-okl
Old Kentish Sign Language
sgn-bfi
British Sign Language
sgn-isg
Irish Sign Language
sgn-ils
International Sign Language

8: Classical/Biblical etc.

grk
Greek (generic)
grc
Ancient Greek
rge
Romano-Greek
el
Modern Greek
la
Latin
sem
Semitic (generic)
oar
Ancient Aramaic
jpa
Jewish Palistinian Aramaic
arc
Aramaic
hbo
Ancient Hebrew
he
Hebrew
cop
Coptic

9: Miscellaneous

mul
Multiple languages
crp
Creoles and pidgins
cpe
English-based creoles and pidgins
art
Artificial
und
Undetermined
mis
Uncoded
zxx
No linguistic content

Appendix 1: Some script Values

latn
Latin
latg
Latin Gaelic
latf
Latin Fraktur
ogam
Ogham
runr
Runic
zmth
Mathematical
zsym
Symbols
brai
Braille
visp
Visible Speech
sgnw
SignWriting
zxxx
Unwritten
zyyy
Undetermined
zzzz
Uncoded

Note that script values should only be defined if they are not a language's default — for example, la-latg (Latin in Gaelic script) and en-sgnw (English in SignWriting script) are correct, while la-latn (Latin in Latin script) and en-latn (English in Latin script) are spurious.

Appendix 2: Some region Values

150
Europe
154
Northern Europe
039
Southern Europe
155
Western Europe
151
Eastern Europe
eu
European Union
gb
Great Britain / United Kingdom (two distinct things, of course)
gg
Guernsey
je
Jersey
im
Isle of Man
gi
Gibraltar
fk
Falkland Islands
ie
Ireland
003
North America
us
United States of America
ca
Canada
029
Caribbean
jm
Jamaica
053
Australia and New Zealand
au
Australia
nz
New Zealand
in
India
pk
Pakistan
017
Middle Africa
018
Southern Africa
ug
Uganda
za
South Africa
zw
Zimbabwe
hk
Hong Kong
145
Western Asia

The above list is far from comprehensive, but focusses on values useful for Celtic Studies and for Post-Colonial Studies regarding some of the main ex-British colonies (especially those colonies that lasted long enough to develop their own major dialects of British languages). Valid examples of usage include: cel-155 (Western European Celtic), non-ie (Irish Old Norse), fro-je (Jersey Old French), en-au (Australian English), gd-ca (Canadian Scottish Gaelic), etc. Note that apparently tautologous values such as ga-ie (Irish Irish) can also be valid. Also see notes to Appendix 3: Some variant Values.

Appendix 3: Some variant Values

alalc97
ALA-LC Romanisation, 1997 edition (eg: el-alalc97)
hepburn
Hepburn romanisation
heploc
Hepburn romanisation, Library of Congress method
wadegile
Wade-Giles romanisation
fonipa
International Phonetic Alphabet
fonxsamp
X-SAMPA transcription
kkcor
Common Cornish orthography of Revived Cornish (eg: kw-kkcor)
uccor
Unified Cornish orthography of Revived Cornish
ucrcor
Unified Cornish Revised orthography of Revived Cornish
scouse
Liverpudlian English (eg: en-scouse)
scotland
Scottish Standard English (eg: en-scotland)
ulster
Ulster Scots (eg: sco-ulster)

Note that scotland and ulster are syntactically variant-values, not region-values! So while en-scotland (Scottish Standard English) is valid, it-scotland (Scottish Italian) is not. Currently, the closest usable value for something like Scottish Italian would be it-gb. However, also note that while Scotland currently has no assigned region-value, neither do England, Wales or Northern Ireland — all U.K. constituents are grouped problematically under "gb" (cf. Appendix 2: Some region Values).

References:

IETF Tools: Tags for Identifying Languages
IANA: Internet Assigned Numbers Authority
IANA: Protocol Registries (cf. "Language Tags")
IANA: Language Subtag Registry
W3C: Language tags in HTML and XML