Hacker News with Generative AI: Interpretability

An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability (adamkarvonen.github.io)
Sparse Autoencoders (SAEs) have recently become popular for interpretability of machine learning models (although sparse dictionary learning has been around since 1997). Machine learning models and LLMs are becoming more powerful and useful, but they are still black boxes, and we don’t understand how they do the things that they are capable of. It seems like it would be useful if we could understand how they work.
A Non-Technical Guide to Interpreting SHAP Analyses (aidancooper.co.uk)
With interpretability becoming an increasingly important requirement for machine learning projects, there's a growing need to communicate the complex outputs of model interpretation techniques to non-technical stakeholders.
PiML: Python Interpretable Machine Learning Toolbox (github.com/SelfExplainML)
PiML (or π-ML, /ˈpaɪ·ˈem·ˈel/) is a new Python toolbox for interpretable machine learning model development and validation.
Interpreting Clip with Sparse Linear Concept Embeddings (SpLiCE) (arxiv.org)
CLIP embeddings have demonstrated remarkable performance across a wide range of computer vision tasks. However, these high-dimensional, dense vector representations are not easily interpretable, restricting their usefulness in downstream applications that require transparency.
Light Recurrent Unit: An Interpretable RNN for Modeling Long-Range Dependency (mdpi.com)
Steering Characters with Interpretability (dmodel.ai)
A Multimodal Automated Interpretability Agent (arxiv.org)
Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability (neuralblog.github.io)