The information extraction (IE) protocol encompasses four stages: problem definition and data preparation, data preprocessing, LLM-based IE, and output evaluation. Designed for clinical researchers without NLP expertise, it enables extraction from medical texts in multiple formats (PDF, CSV, TXT, Excel). The system is cost-effective, operating on low-resource hardware (e.g., a GPU with 48 GB VRAM). Users can preprocess documents via a GUI, applying Optical Character Recognition (OCR) as needed. The protocol supports the latest LLMs, allowing for extraction tasks without complex programming. After defining extraction parameters, the protocol produces output in CSV format, generating confusion matrices for performance evaluation. Two primary functions include information extraction and document anonymization. Users can download and set up the pipeline easily via Docker or manual installation. This open-source tool is optimized for various models, including Llama, ensuring accessibility and adaptability in processing medical data effectively.
Source link