Hacker News with Generative AI: Linguistics

Searching for DeepSeek's glitch tokens (outsidetext.substack.com)
“Anomalous”, “glitch”, or “unspeakable” tokens in an LLM are those that induce bizarre behavior or otherwise don’t behave like regular text.
Why is zero plural? (2024) (stackexchange.com)
For example, if we choose two 2s, zero 3s, and one 5, we get the divisor
Brits still associate working-class accents with criminals – study warns of bias (cam.ac.uk)
People who speak with accents perceived as ‘working-class’ including those from Liverpool, Newcastle, Bradford and London risk being stereotyped as more likely to have committed a crime, and becoming victims of injustice, a new study suggests.
The rise and fall of the English sentence (2017) (nautil.us)
The surprising forces influencing the complexity of the language we speak and write.
Bog Standard (2005) (bbc.co.uk)
It's pretty rare in English to find a compound word with a slang first part and a formal second part.
Did OpenAI's O1 Decipher the Indus Valley Script? (yashgoenka.com)
A few weeks ago, I had a fascinating conversation with OpenAI's O1 model about decoding the Indus Valley script - one of the world's oldest and still undeciphered writing systems.
English-friendly Romanization system proposed for Japanese language (asahi.com)
The Agency for Cultural Affairs is soliciting public comments about its plans to change romanization rules of the Japanese language for the first time in about 70 years.
2025 Banished Words List (lssu.edu)
Lake Superior State University (LSSU) proudly reveals the 2025 edition of its Banished Words List, a quirky tradition that dates back to 1976, when former LSSU Public Relations Director Bill Rabe and his colleagues delighted word enthusiasts with the first “List of Words Banished from the Queen’s English for Mis-Use, Over-Use and General Uselessness”.
Ancient Indus Valley Script Deciphered (indusscript.net)
The official Indus inscriptions repository
Ancient genomes provide final word in Indo-European linguistic origins (phys.org)
A team of 91 researchers—including famed geneticist Eske Willerslev at the Lundbeck Foundation GeoGenetics Center, University of Copenhagen—has discovered a Bronze Age genetic divergence connected to eastern and western Mediterranean Indo-European language speakers.
Interpol wants everyone to stop saying 'pig butchering' (theregister.com)
Interpol wants to put an end to the online scam known as "pig butchering" – through linguistic policing, rather than law enforcement.
Noam Chomsky at 96 (theconversation.com)
Noam Chomsky, one of the world’s most famous and respected intellectuals, will be 96 years old on Dec. 7, 2024. For more than half a century, multitudes of people have read his works in a variety of languages, and many people have relied on his commentaries and interviews for insights about intellectual debates and current events.
MIT study explains why laws are written in an incomprehensible style (news.mit.edu)
Legal documents are notoriously difficult to understand, even for lawyers. This raises the question: Why are these documents written in a style that makes them so impenetrable?
Mysterious tablet with unknown language unearthed in Georgia (archaeologymag.com)
A basalt tablet inscribed with an enigmatic language has been unearthed near Lake Bashplemi in Georgia’s Dmanisi region.
Learning Tibetan changed the way I think (2023) (lionsroar.com)
Translator Estefania Duque shares her journey studying Tibetan, revealing how language shapes the mind, influences perspective, and offers spiritual inspiration.
AI Guesses Your Accent (boldvoice.com)
Do you have an accent when speaking English? I bet I can guess your native language in less than 30 seconds.
Martha's Vineyard Sign Language (atlasobscura.com)
In 1979 in the town of Chilmark, on Martha’s Vineyard, Joan Poole Nash sat across from her great-grandmother Emily Howland Poole, surrounded by a team of linguists and a video camera. “Do you remember the signs for rain or snow?” In response her great-grandmother moved her hands, which were recorded on grainy, black-and-white-tape.
Phonetic Matching (smoores.dev)
Just as heads up: This post starts out somewhat technical and includes a discussion of interesting algorithmic topics, like forced alignment and phonetic matching. But it ends by delving into some deeper social and human topics that might not be what everyone is looking for in a blog that’s mostly about software.
Chrestomathy (wikipedia.org)
A chrestomathy (/krɛˈstɒməθi/ kreh-STOM-ə-thee; from the Ancient Greek χρηστομάθεια khrēstomátheia 'desire of learning', from χρηστός khrēstós 'useful' + μανθάνω manthánō 'learn') is a collection of selected literary passages (usually from a single author); a selection of literary passages from a foreign language assembled for studying the language; or a text in various languages, used especially as an aid in learning a subject.
Mathematical meaning is not captured by meaning-as-use (willzeng.com)
“The meaning of a word is its use in the language.” - Wittgenstein, Philosophical Investigations
Pre-Greek Substrate (wikipedia.org)
The pre-Greek substrate (or substratum) consists of the unknown pre-Greek language or languages (either Pre-Indo-European or other Indo-European languages) spoken in prehistoric Greece prior to the emergence of the Proto-Greek language in the region c. 3200–2200 BC, during the Early Helladic period.
A brief history of the word "fuck" (lithub.com)
In all of English there are few words rich enough in their history and variety of use to warrant a dedicated dictionary that runs to hundreds of pages and multiple editions.
Gadsby: Wikip_dia's Lost Lipogram (2015) [pdf] (core.ac.uk)
Title drops in movies (titledrops.net)
A title drop is when a character in a movie says the title of the movie they're in. Here's a large-scale analysis of 73,921 movies from the last 80 years on how often, when and maybe even why that happens.
The 1600s were a watershed for swear words (2022) (historytoday.com)
Swear words are a constant, but their ability to cause offence is in flux. In the 1600s, today's obscenities were mundane.
A Misreading Of Yoda (gregorypellechi.com)
This essay will be finished.
Indo-European Words for 'Name' (starkeycomics.com)
I’ve created a huge tree to show the relationship between 64 living Indo-European languages, and many dead or extinct ones.With this template I’m planning on making a series of images to show how various words in these languages have shared etymologies. This is the first image in that series: words for “name”.
Mondegreen (wikipedia.org)
A mondegreen (/ˈmɒndɪˌɡriːn/ ⓘ) is a mishearing or misinterpretation of a phrase in a way that gives it a new meaning.[1] Mondegreens are most often created by a person listening to a poem or a song; the listener, being unable to hear a lyric clearly, substitutes words that sound similar and make some kind of sense.[2][3] The American writer Sylvia Wright coined the term in 1954, recalling a childhood memory of her mother reading the Scottish ballad "The Bonnie Earl
Rekhta (wikipedia.org)
Rekhta (Urdu: ریختہ [ˈreːxtaː]; Hindi: रेख़्ता [ˈreːxtaː]) was an early form of the Hindustani language.
Language is not essential for the cognitive processes that underlie thought (scientificamerican.com)
Scholars have long contemplated the connection between language and thought—and to what degree the two are intertwined—by asking whether language is somehow an essential prerequisite for thinking.