Hacker News with Generative AI: Text Processing

Stripping Emoji from a String (brettterpstra.com)
I often need to strip emoji from strings to prevent them from messing up other handling. I’ve been compiling regular expressions and I think I finally have all the bases covered.
T2x – a CLI tool for AI-first text operations (shruggingface.com)
I've started hacking on a new open source CLI tool I'm calling t2x, short for "text to whatever"
Simpler and faster parsing code with std:views:split (lemire.me)
Parsing text files is often confusing irrespective of your programming language. It can also be surprising slow.
Awk in 20 Minutes (2015) (ferd.ca)
Awk is a tiny programming language and a command line tool. It's particularly appropriate for log parsing on servers, mostly because Awk will operate on files, usually structured in lines of human-readable text.
Extending the context length to 1M tokens (qwenlm.github.io)
After the release of Qwen2.5, we heard the community’s demand for processing longer contexts.
S/Sed/Ed (aartaka.me)
This post starts with holding a grudge: Posix regular expressions are extremely hard to get wrong? Uh... Have you really written any? Sounds like you might not really know either Posix or PCRE. u/bigmell in reply to 5 (Wrong) Regex To Parse Parentheses
Show HN: Chonkie – A Fast, Lightweight Text Chunking Library for RAG (github.com/bhavnicksm)
🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
ASCII Delimited Text – Not CSV or Tab Delimited Text (wordpress.com)
Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format.  ASCII Delimited Text should use the record separators defined as ASCII 28-31.
Learn Awk in Y Minutes (learnxinyminutes.com)
AWK is a standard tool on every POSIX-compliant UNIX system. It’s like flex/lex, from the command-line, perfect for text-processing tasks and other scripting needs. It has a C-like syntax, but without mandatory semicolons (although, you should use them anyway, because they are required when you’re writing one-liners, something AWK excels at), manual memory management, or static typing. It excels at text processing. You can call to it from a shell script, or you can use it as a stand-alone scripting language.
Lisp Query Notation (LQN) (inconvergent.net)
For a while I have wanted to make my own terminal utility for manipulating text files. Some version of Sed, or AWK; or maybe even .jq. And I finally did. So here are the first 25 Fibonacci numbers calculated, and printed in an unnecessarily complicated way, using my new query language: Lisp Query Notation (LQN):
Bionic reading converter for ePub in Rust (github.com/mmatczuk)
Bioniconv is a single pass bionic reading converter for epub files. It is written in Rust for single threaded performance.
Data Version Control (dvc.org)
Extract and parse text from documents and create vector embeddings in a scalable and distributed way (and less than 70 lines of code). Read more.
Unix for Poets: Basic NLP Tasks Using Unix Tools (medium.com)
Often, we become so captivated by complexity and sophistication that we overlook the profound effectiveness of simple, fundamental methods.
Idiomatic Awk (2010) (backreference.org)
This is just one of many possible ways to do this.
Rosie pattern language: modern text pattern matching to replace regex (rosie-lang.org)
In brief: RPL is an alternative to regex, providing a better syntax, unit tests, and packages of patterns, among other benefits.
Smlr – truncate strings in a pretty way (github.com/thenatefisher)
Truncates stdin to a maximum fixed size, abbreviating the output if over the specified length. For example, make a giant git branch name more manageable for use in PS1.
Ugrep: A more powerful, ultra fast, user-friendly, compatible grep (github.com/Genivia)
NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
Rga: Ripgrep, but also search in PDFs, E-Books, Office documents, zip, etc. (github.com/phiresky)
rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. rga wraps the awesome ripgrep and enables it to search in pdf, docx, sqlite, jpg, movie subtitles (mkv, mp4), etc.
Text makeup – a tool to decode and explore Unicode strings (text.makeup)
This site is a proof of concept. Only some aspects and some specific examples work (more information about coverage). Please send encouragement and bug reports if you’d like it become a real thing!
Show HN: Automatic chaptering – From raw transcripts to structured documents (huggingface.co)
Running
Bitten by Unicode (pyatl.dev)
One product of mine takes reports that come in as a table that’s been exported to PDF, which means text extraction. For dollar figures I find a prefixed dollar symbol and convert the number following it into a `float`. If there’s a hyphen in addition to the dollar symbol, it’s negative.
Asciidoctor: A fast text processor and publishing toolchain (asciidoctor.org)
Asciidoctor is a fast, open source, Ruby-based text processor for parsing AsciiDoc® into a document model and converting it to output formats such as HTML 5, DocBook 5, manual pages, PDF, EPUB 3, and other formats.
Advanced text features and PDF (blogspot.com)
Unprojecting text with ellipses (2016) (mzucker.github.io)
Show HN: Jacinda, a functional Awk (text stream processing on the comamnd-line) (haskell.org)
Show HN: FileKitty – Combine and label text files for LLM prompt contexts (github.com/banagale)
Show HN: Purl – A Simple Tool for Text Processing (github.com/catatsuy)