Automated interpretability seeks to convert large language model (LLM) features into comprehensible descriptions. Traditional natural language descriptions often fall short, being vague and inconsistent, necessitating manual relabeling. To address this, we propose semantic regexes, which provide structured language descriptions of LLM features. By integrating linguistic and semantic pattern primitives with contextual modifiers, semantic regexes deliver precise, expressive descriptions. Our quantitative and qualitative analyses demonstrate that semantic regexes achieve comparable accuracy to natural language, offering more concise and consistent insights. Furthermore, their structured approach enables innovative analyses, such as quantifying feature complexity across layers and scaling automated interpretability from individual to model-wide feature patterns. User studies reveal that semantic regex descriptions significantly aid users in developing accurate mental models of LLM feature activations. This advancement not only enhances understanding but also improves the interpretability of complex models, positioning semantic regexes as a crucial tool for automated interpretability in machine learning.
Source link
