Hacker News with Generative AI: Text Processing

Urtext: The Python plaintext library for people who've tried everything else (urtext.co)
Urtext /ˈʊrtekst/ is an open-source library for plaintext writing, research, documentation, knowledge bases, journaling, Zettelkasten, project/personal organization, note taking, a lightweight database substitute, or any other writing or information management that can be done in text format.

Python, Text Processing, Open Source, Knowledge Management, Writing

101 points by nbeversluis 71 days ago | 71 comments

Audio Effects Applied to Text (hackaday.com)
If you are a visual thinker, you might enjoy [AIHVHIA’s] recent video, which shows the effect of applying audio processing to text displayed on an oscilloscope.

Audio Effects, Visualization, Creative Projects, Programming, Text Processing

9 points by beardyw 89 days ago | 2 comments

Show HN: Textcase: A Python Library for Text Case Conversion (github.com/zobweyt)
A feature complete Python text case conversion library.

Python, Text Processing, Libraries, Open Source

71 points by zobweyt 100 days ago | 41 comments

StarVector: Generating Scalable Vector Graphics Code from Images and Text (starvector.github.io)
StarVector represents a breakthrough in Scalable Vector Graphics (SVG) generation, seamlessly integrating visual and textual inputs into a unified foundation SVG model.

Image Processing, Text Processing, AI, Code Generation

72 points by lnyan 111 days ago | 7 comments

Show HN: Jq-Like Tool for Markdown (github.com/yshavit)
mdq aims to do for Markdown what jq does for JSON: provide an easy way to zero in on specific parts of a document.

Markdown, Text Processing

325 points by yshavit 137 days ago | 75 comments

Regex is almost all you need (lookingatcomputer.substack.com)
When it comes to secret detection, regex is all you need. Almost.

Regex, Programming, Computer Science, Text Processing

13 points by addams 152 days ago | 0 comments

Ask HN: Is There Any Pattern Matching Bible or Set of Essential Readings? (ycombinator.com)
Publication analyzing/unifying topics like Dynamic Time Warping, Spell checking, Knuth-Moris-Pratt andother string search, Hidden Markov Chains, and many other text/image/signal processing algorithms/metrics in coherent form?

Pattern Matching, Algorithm, Data Science, Text Processing, Computer Science

6 points by ZevsVultAveHera 159 days ago | 8 comments

Stripping Emoji from a String (brettterpstra.com)
I often need to strip emoji from strings to prevent them from messing up other handling. I’ve been compiling regular expressions and I think I finally have all the bases covered.

Programming, Regular Expressions, Text Processing

8 points by zdw 187 days ago | 0 comments

T2x – a CLI tool for AI-first text operations (shruggingface.com)
I've started hacking on a new open source CLI tool I'm calling t2x, short for "text to whatever"

CLI Tools, Open Source, AI, Text Processing

38 points by marckohlbrugge 192 days ago | 14 comments

Simpler and faster parsing code with std:views:split (lemire.me)
Parsing text files is often confusing irrespective of your programming language. It can also be surprising slow.

Programming, Performance, Text Processing, C++

6 points by signa11 199 days ago | 0 comments

Awk in 20 Minutes (2015) (ferd.ca)
Awk is a tiny programming language and a command line tool. It's particularly appropriate for log parsing on servers, mostly because Awk will operate on files, usually structured in lines of human-readable text.

Programming, Text Processing

90 points by ayoisaiah 218 days ago | 21 comments

Extending the context length to 1M tokens (qwenlm.github.io)
After the release of Qwen2.5, we heard the community’s demand for processing longer contexts.

Generative AI, Context Length, Text Processing

116 points by cmcconomy 234 days ago | 107 comments

S/Sed/Ed (aartaka.me)
This post starts with holding a grudge: Posix regular expressions are extremely hard to get wrong? Uh... Have you really written any? Sounds like you might not really know either Posix or PCRE. u/bigmell in reply to 5 (Wrong) Regex To Parse Parentheses

Regular Expressions, Programming, Text Processing, Unix, Software

51 points by thunderbong 238 days ago | 29 comments

Show HN: Chonkie – A Fast, Lightweight Text Chunking Library for RAG (github.com/bhavnicksm)
🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library

Text Processing, Libraries

199 points by bhavnicksm 242 days ago | 36 comments

ASCII Delimited Text – Not CSV or Tab Delimited Text (wordpress.com)
Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format. ASCII Delimited Text should use the record separators defined as ASCII 28-31.

Data Formats, Programming, Text Processing, Technical Standards

114 points by ejstronge 242 days ago | 117 comments

Learn Awk in Y Minutes (learnxinyminutes.com)
AWK is a standard tool on every POSIX-compliant UNIX system. It’s like flex/lex, from the command-line, perfect for text-processing tasks and other scripting needs. It has a C-like syntax, but without mandatory semicolons (although, you should use them anyway, because they are required when you’re writing one-liners, something AWK excels at), manual memory management, or static typing. It excels at text processing. You can call to it from a shell script, or you can use it as a stand-alone scripting language.

Programming Languages, Shell Scripting, Text Processing

12 points by sandwichsphinx 248 days ago | 5 comments

Lisp Query Notation (LQN) (inconvergent.net)
For a while I have wanted to make my own terminal utility for manipulating text files. Some version of Sed, or AWK; or maybe even .jq. And I finally did. So here are the first 25 Fibonacci numbers calculated, and printed in an unnecessarily complicated way, using my new query language: Lisp Query Notation (LQN):

Programming Languages, Text Processing

135 points by surprisetalk 252 days ago | 2 comments

Bionic reading converter for ePub in Rust (github.com/mmatczuk)
Bioniconv is a single pass bionic reading converter for epub files. It is written in Rust for single threaded performance.

Rust, ePub, Text Processing, Accessibility, Performance

10 points by michalmatczuk 256 days ago | 5 comments

Data Version Control (dvc.org)
Extract and parse text from documents and create vector embeddings in a scalable and distributed way (and less than 70 lines of code). Read more.

Machine Learning, Data Science, Text Processing, Software

213 points by shcheklein 264 days ago | 52 comments

Unix for Poets: Basic NLP Tasks Using Unix Tools (medium.com)
Often, we become so captivated by complexity and sophistication that we overlook the profound effectiveness of simple, fundamental methods.

Unix, Programming, Text Processing, Data Analysis

7 points by theali 268 days ago | 0 comments

Idiomatic Awk (2010) (backreference.org)
This is just one of many possible ways to do this.

Programming, Tools, Text Processing, Unix

132 points by StefanBatory 278 days ago | 33 comments

Rosie pattern language: modern text pattern matching to replace regex (rosie-lang.org)
In brief: RPL is an alternative to regex, providing a better syntax, unit tests, and packages of patterns, among other benefits.

Programming, Regular Expressions, Software, Text Processing

18 points by fanf2 286 days ago | 6 comments

Smlr – truncate strings in a pretty way (github.com/thenatefisher)
Truncates stdin to a maximum fixed size, abbreviating the output if over the specified length. For example, make a giant git branch name more manageable for use in PS1.

Software, Text Processing, Git

9 points by fallingmeat 296 days ago | 2 comments

Ugrep: A more powerful, ultra fast, user-friendly, compatible grep (github.com/Genivia)
NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more

Software, Search, Text Processing

6 points by charlieirish 296 days ago | 1 comments

Rga: Ripgrep, but also search in PDFs, E-Books, Office documents, zip, etc. (github.com/phiresky)
rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. rga wraps the awesome ripgrep and enables it to search in pdf, docx, sqlite, jpg, movie subtitles (mkv, mp4), etc.

Search Tools, Software, Text Processing, File Formats, Open Source

516 points by bukacdan 296 days ago | 57 comments

Text makeup – a tool to decode and explore Unicode strings (text.makeup)
This site is a proof of concept. Only some aspects and some specific examples work (more information about coverage). Please send encouragement and bug reports if you’d like it become a real thing!

Unicode, Text Processing, Tools, Programming, Web Development

98 points by microflash 298 days ago | 12 comments

Show HN: Automatic chaptering – From raw transcripts to structured documents (huggingface.co)
Running

Machine Learning, Text Processing, Software

5 points by Yannael 304 days ago | 0 comments

Bitten by Unicode (pyatl.dev)
One product of mine takes reports that come in as a table that’s been exported to PDF, which means text extraction. For dollar figures I find a prefixed dollar symbol and convert the number following it into a `float`. If there’s a hyphen in addition to the dollar symbol, it’s negative.

Python, Programming, Text Processing, Data Extraction

130 points by pryelluw 305 days ago | 129 comments

Asciidoctor: A fast text processor and publishing toolchain (asciidoctor.org)
Asciidoctor is a fast, open source, Ruby-based text processor for parsing AsciiDoc® into a document model and converting it to output formats such as HTML 5, DocBook 5, manual pages, PDF, EPUB 3, and other formats.