Hacker News with Generative AI: Interpretability

Show HN: Neuronpedia, an open source platform for AI interpretability (neuronpedia.org)
Neuronpedia is an open source interpretability platform.

Open Source, AI, Interpretability, Machine Learning

6 points by hijohnnylin 101 days ago | 0 comments

Circuit Tracing: Revealing Computational Graphs in Language Models (transformer-circuits.pub)
We introduce a method to uncover mechanisms underlying behaviors of language models. We produce graph descriptions of the model’s computation on prompts of interest by tracing individual computational steps in a “replacement model”. This replacement model substitutes a more interpretable component (here, a “cross-layer transcoder”) for parts of the underlying model (here, the multi-layer perceptrons) that it is trained to approximate.

Generative AI, Machine Learning, Language Models, Interpretability, Research

8 points by mfiguiere 105 days ago | 0 comments

An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability (adamkarvonen.github.io)
Sparse Autoencoders (SAEs) have recently become popular for interpretability of machine learning models (although sparse dictionary learning has been around since 1997). Machine learning models and LLMs are becoming more powerful and useful, but they are still black boxes, and we don’t understand how they do the things that they are capable of. It seems like it would be useful if we could understand how they work.

Machine Learning, Interpretability, Artificial Intelligence

110 points by sebg 224 days ago | 3 comments

A Non-Technical Guide to Interpreting SHAP Analyses (aidancooper.co.uk)
With interpretability becoming an increasingly important requirement for machine learning projects, there's a growing need to communicate the complex outputs of model interpretation techniques to non-technical stakeholders.

Machine Learning, Interpretability, Data Science, Non-technical

14 points by reqo 228 days ago | 0 comments

PiML: Python Interpretable Machine Learning Toolbox (github.com/SelfExplainML)
PiML (or π-ML, /ˈpaɪ·ˈem·ˈel/) is a new Python toolbox for interpretable machine learning model development and validation.

Machine Learning, Python, Open Source, Tools, Interpretability

97 points by skadamat 247 days ago | 20 comments

Interpreting Clip with Sparse Linear Concept Embeddings (SpLiCE) (arxiv.org)
CLIP embeddings have demonstrated remarkable performance across a wide range of computer vision tasks. However, these high-dimensional, dense vector representations are not easily interpretable, restricting their usefulness in downstream applications that require transparency.

Computer Vision, Machine Learning, Embeddings, Interpretability, Artificial Intelligence

7 points by fzliu 280 days ago | 0 comments

Light Recurrent Unit: An Interpretable RNN for Modeling Long-Range Dependency (mdpi.com)

Long-Range Dependencies, Interpretability, Machine Learning, Deep Learning

7 points by PaulHoule 318 days ago | 0 comments

Steering Characters with Interpretability (dmodel.ai)

AI, Machine Learning, Interpretability, Language Models

23 points by atondwal 330 days ago | 3 comments

A Multimodal Automated Interpretability Agent (arxiv.org)

Artificial Intelligence, Machine Learning, Interpretability

83 points by el_duderino 351 days ago | 7 comments

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability (neuralblog.github.io)

Machine Learning, Interpretability, Transformers

49 points by xcodevn 388 days ago | 8 comments