Friday, February 20, 2026

Unveiling Hidden Biases, Emotions, and Concepts in Large Language Models | MIT News

Researchers from MIT and UC San Diego have developed an innovative technique to identify and manipulate abstract concepts embedded in large language models (LLMs) like ChatGPT and Claude. This method allows for the detection of biases, personalities, and moods by targeting specific connections within the model. The team successfully analyzed over 500 concepts, able to enhance or reduce representations such as “conspiracy theorist” or “social influencer” in model responses. For example, they could evoke a conspiracy theorist’s tone when discussing the “Blue Marble” image from Apollo 17. While acknowledging potential risks of revealing certain concepts, the researchers believe their approach illuminates LLM vulnerabilities and enhances safety. Utilizing a predictive model called a recursive feature machine (RFM), they efficiently extracted targeted representations, ultimately aiming for specialized LLMs that are safe and effective. Their findings, published in Science, underscore the complexity and potential of LLMs in handling nuanced human-like responses.

Source link

Share

Read more

Local News