6 points by ZevsVultAveHera 20 days ago | 8 comments
Stripping Emoji from a String(brettterpstra.com) I often need to strip emoji from strings to prevent them from messing up other handling. I’ve been compiling regular expressions and I think I finally have all the bases covered.
Awk in 20 Minutes (2015)(ferd.ca) Awk is a tiny programming language and a command line tool. It's particularly appropriate for log parsing on servers, mostly because Awk will operate on files, usually structured in lines of human-readable text.
116 points by cmcconomy 95 days ago | 107 comments
S/Sed/Ed(aartaka.me) This post starts with holding a grudge: Posix regular expressions are extremely hard to get wrong? Uh... Have you really written any? Sounds like you might not really know either Posix or PCRE. u/bigmell in reply to 5 (Wrong) Regex To Parse Parentheses
199 points by bhavnicksm 103 days ago | 36 comments
ASCII Delimited Text – Not CSV or Tab Delimited Text(wordpress.com) Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format. ASCII Delimited Text should use the record separators defined as ASCII 28-31.
114 points by ejstronge 103 days ago | 117 comments
Learn Awk in Y Minutes(learnxinyminutes.com) AWK is a standard tool on every POSIX-compliant UNIX system. It’s like flex/lex, from the command-line, perfect for text-processing tasks and other scripting needs. It has a C-like syntax, but without mandatory semicolons (although, you should use them anyway, because they are required when you’re writing one-liners, something AWK excels at), manual memory management, or static typing. It excels at text processing. You can call to it from a shell script, or you can use it as a stand-alone scripting language.
12 points by sandwichsphinx 109 days ago | 5 comments
Lisp Query Notation (LQN)(inconvergent.net) For a while I have wanted to make my own terminal utility for manipulating text files. Some version of Sed, or AWK; or maybe even .jq. And I finally did. So here are the first 25 Fibonacci numbers calculated, and printed in an unnecessarily complicated way, using my new query language: Lisp Query Notation (LQN):
10 points by michalmatczuk 116 days ago | 5 comments
Data Version Control(dvc.org) Extract and parse text from documents and create vector embeddings in a scalable and distributed way (and less than 70 lines of code). Read more.
Bitten by Unicode(pyatl.dev) One product of mine takes reports that come in as a table that’s been exported to PDF, which means text extraction. For dollar figures I find a prefixed dollar symbol and convert the number following it into a `float`. If there’s a hyphen in addition to the dollar symbol, it’s negative.