Hacker News with Generative AI: Data Management

InfluxDB 3 Open Source Now in Public Alpha Under MIT/Apache 2 License (influxdata.com)
Today we’re excited to announce the alpha release of InfluxDB 3 Core (download), the new open source product in the InfluxDB 3 product line along with InfluxDB 3 Enterprise (download), a commercial version that builds on Core’s foundation.
ZFS 2.3 released with ZFS raidz expansion (github.com/openzfs)
We are excited to announce the release of OpenZFS 2.3.0.
Gmvault: Backup and restore your Gmail account (ycombinator.com)
Gmvault: Backup and restore your Gmail account
Parquet and ORC's many shortfalls for machine learning, and what to do about it? (starburst.io)
At the turn of the century (around a quarter of a decade ago), over 99% of the data management industry used row-oriented storage to store data for all workloads involving structured data — including transactional and analytical workloads.
Using watermarks to coordinate change data capture in Postgres (sequinstream.com)
In change data capture, consistency is paramount. A single missing or duplicate message can cascade into time-consuming bugs and erode trust in your entire system. The moment you find a record missing in the destination, you have to wonder: is this the only one? How many others are there?
Managing Data Corruption in the Cloud (mongodb.com)
MongoDB has been named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMSs) for the third consecutive year.
Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP) (ycombinator.com)
Is stuff online worth saving? (rubenerd.com)
Related to my post about self-hosted bookmarking tools (thanks to everyone for the suggestions!), I’ve been exporting my bookmarks from various sites so I can eventually aggregate them into once place. It’s a lot of work, and it might mostly be for naught.
Ask questions of SQLite databases and CSV/JSON files in your terminal (simonwillison.net)
I built a new plugin for my sqlite-utils CLI tool that lets you ask human-language questions directly of SQLite databases and CSV/JSON files on your computer.
DELETEs Are Difficult (boringsql.com)
Your database is ticking along nicely - until a simple DELETE brings it to its knees. What went wrong? While we tend to focus on optimizing SELECT and INSERT operations, we often overlook the hidden complexities of DELETE. Yet, removing unnecessary data is just as critical. Outdated or irrelevant data can bloat your database, degrade performance, and make maintenance a nightmare. Worse, retaining some types of data without valid justification might even lead to compliance issues.
NASA SC24: NASA-GPT: Searching the Entire NASA Technical Reports Server Using AI (nasa.gov)
Researchers at NASA’s Ames Research Center in Silicon Valley are using artificial intelligence (AI) tools to create a data-sharing resource for the agency’s scientific and engineering staff.
Snowflake opens chat-driven access to enterprise and third-party data (theregister.com)
Snowflake is set to preview a new platform it claims will help organizations build chatbots that can serve up data from its own analytics systems and those external to the cloud data platform vendor.
Netflix's Distributed Counter Abstraction (netflixtechblog.com)
In our previous blog post, we introduced Netflix’s TimeSeries Abstraction, a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction. This counting service, built on top of the TimeSeries Abstraction, enables distributed counting at scale while maintaining similar low latency performance. As with all our abstractions, we use our Data Gateway Control Plane to shard, configure, and deploy this service globally.
Utilizing HubSpot's APIs to Break Down Data Silos (accelant.com)
I’ve seen this movie countless times: Business Has Tons of Useful Data That’s Scattered Across Systems and Isn’t Being Used for Much (or anything).
Evolving a NoSQL Database Schema (karmanivero.us)
In a NoSQL environment, Entity Manager organizes the physical distribution of data to support efficient query operations.
Get me out of data hell (mataroa.blog)
It is 9:59 AM in Melbourne, 9th October, 2024. Sunlight filters through my windows, illuminating swirling motes of dust across my living room. There is a cup of tea in my hand. I take a sip and savor it.
Make It Ephemeral: Software Should Decay and Lose Data (pocoo.org)
Most software that exists today does not forget. Creating software that remembers is easy, but designing software that deliberately “forgets” is a bit more complex. By “forgetting,” I don't mean losing data because it wasn’t saved or losing it randomly due to bugs. I'm referring to making a deliberate design decision to discard data at a later time. This ability to forget can be an incredibly benefitial property for many applications. Most importantly software that forgets enables different user experiences.
Nulls: Revisiting null representation in modern columnar formats (dl.acm.org)
Nulls are common in real-world data sets, yet recent research on columnar formats and encodings rarely address Null representations.
A brief history of Notion's data catalog (notion.so)
Over the past few years, the number of data assets and systems Notion uses has skyrocketed.
A new JSON data type for ClickHouse (clickhouse.com)
JSON has become the lingua franca for handling semi-structured and unstructured data in modern data systems. Whether it’s in logging and observability scenarios, real-time data streaming, mobile app storage, or machine learning pipelines, JSON’s flexible structure makes it the go-to format for capturing and transmitting data across distributed systems.
Profile generation and data management with dbm (deepnote.com)
Ask HN: How big of a problem is unstructured data for companies? (ycombinator.com)
I read somewhere that 90% of companies have data like documents, PDFs, videos, images, audio clips, and other content that are unstructured that will be a big obstacle for ai.
Restic: Backups done right (restic.net)
Restic is a modern backup program that can back up your files:
Secure Custom Fields by WordPress.org (wordpress.org)
Secure Custom Fields (SCF) turns WordPress sites into a fully-fledged content management system by giving you all the tools to do more with your data.
How CERN serves 1EB of data via FUSE [video] (kernel-recipes.org)
CERN, the European Organization for Nuclear Research, generates vast amounts of data from experiments at the Large Hadron Collider (LHC).
Ask HN: What do you use to backup your VMs? (ycombinator.com)
How do you backup VMs with installed Postgres, MariaDB instances and local files?
Insights after 11 years with Datomic [video] (youtube.com)
Datomic and Content Addressable Techniques (latacora.com)
Latacora collects and analyzes data about services our clients use. You may have read about our approach to building security tooling, but the tl;dr is we make requests to all the (configuration metadata) read-only APIs available to us and store the results in S3. We leverage the data to understand our clients’ infrastructure and identify security issues and misconfigurations. We retain the files (“snapshots”) to support future IR/forensics efforts.
Rearchiving 2M hours of digital radio, a comprehensive process (digitalpreservation-blog.nb.no)
Facebook uses 10k Blu-ray discs to store 'cold' data (2014) (pcworld.com)