Every day, German courts publish insolvency announcements on insolvenzbekanntmachungen.de — the official portal for bankruptcy proceedings across all 16 federal states. These announcements contain structured legal data about companies and individuals entering insolvency, including court details, creditor meetings, administrator appointments, and case timelines. But the portal itself is designed for manual lookup, not systematic analysis. Inso Crawler is a Python-based intelligence platform that crawls, parses, classifies, and enriches German insolvency data using a combination of traditional parsing, LLM-based extraction (Gemini 2.0 Flash), and machine learning classification, turning raw legal announcements into structured, searchable business intelligence.
The Problem
German insolvency data is public but practically inaccessible at scale. The official portal at neu.insolvenzbekanntmachungen.de uses a JSF (JavaServer Faces) web application with server-side state management, JSESSIONID cookies, and ViewState tokens — making it hostile to automated access. The search interface requires manual selection of federal states, date ranges, and case types. Results are displayed as pipe-delimited text blocks buried in HTML, with no API, no export function, and no way to perform cross-regional analysis.
For credit analysts, debt collection firms, market researchers, and risk management teams, this means that monitoring German insolvency trends requires either expensive commercial data providers or tedious manual searches across all 16 federal states, every single day.
The official insolvency portal publishes data for all of Germany, but the interface was designed for a single court clerk looking up a single case — not for anyone who needs to understand the bigger picture.
The Solution
Inso Crawler is a multi-layered data intelligence system that solves this problem through four coordinated components: a stateful HTTP crawler that navigates the JSF application, a dual parsing engine (regex + LLM), a business classification enrichment pipeline, and a web UI for browsing and analyzing the collected data.
Key Features
Stateful JSF Crawler
The crawler communicates natively with the official portal via HTTP — no browser automation required. It maintains JSF session state by tracking JSESSIONID cookies and ViewState tokens, navigating the application’s form submissions and postbacks programmatically. It systematically iterates across all 16 federal states and all case categories:
- Sicherungsmaßnahmen (Precautionary measures)
- Abweisungen mangels Masse (Rejections for lack of assets)
- Eröffnungen (Case openings)
- Entscheidungen im Verfahren (Procedural decisions)
- Verteilungsverzeichnisse (Distribution schedules)
- Restschuldbefreiung (Residual debt discharge)
- Überwachte Insolvenzpläne (Supervised insolvency plans)
Dual Parsing Engine
Each announcement is processed by two complementary parsers:
Regex Parser (parser.py): A robust traditional parser that extracts structured fields from pipe-delimited announcement text using carefully crafted regular expressions. It extracts:
- Case reference numbers (Aktenzeichen)
- Court and publication dates
- Debtor information (name, type, address, registration details)
- Administrator details (name, title, role, contact information)
- Hearing dates and locations
- Filing deadlines and claim registration periods
LLM Parser (llm_parser.py): A Gemini 2.0 Flash-based extraction system that uses function calling for guaranteed structured JSON output. The LLM parser handles edge cases, unusual formatting, and ambiguous text that the regex parser might miss. It extracts the same fields but with the added intelligence of understanding natural language context in German legal text.
Business Classification Enrichment
For corporate insolvencies, the system enriches records with Wirtschaftszweig (WZ) industry classification codes. When the announcement text does not contain business sector information, the wz_lookup.py module queries the Serper Places API (Google Maps/MyBusiness data) to identify the company’s industry based on its name and location. This enrichment transforms raw legal data into business intelligence.
Company Detection Heuristics
The crawler automatically distinguishes between personal and corporate insolvencies using keyword-based heuristics that detect legal entity suffixes:
_FIRMA_KEYWORDS = [
"gmbh", "ag", "ug", "e.k.", "ohg", "kg", "gbr", "ltd", "se",
"stiftung", "verein", "genossenschaft", "partg", "eg",
"gesellschaft mit beschränkter haftung",
]
Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Crawler | Python 3.12, Requests | HTTP-based JSF navigation |
| Traditional Parsing | Regex, dataclasses | Structured field extraction |
| LLM Parsing | Google Gemini 2.0 Flash | Intelligent text extraction |
| NLP | spaCy (de_core_news_lg) | German named entity recognition |
| ML Classification | scikit-learn, PyTorch | Document and entity classification |
| Data Analysis | NumPy, Pandas, Matplotlib | Statistical analysis and visualization |
| Graph Analysis | NetworkX | Relationship mapping between entities |
| NLP Embeddings | sentence-transformers | Semantic similarity and search |
| Web UI | FastAPI, Uvicorn | Web interface (port 8420) |
| Enrichment | Serper Places API | WZ industry code lookup |
| Storage | SQLite (sqlitedict), JSON caching | Persistent data storage |
Architecture
The system operates as a pipeline with four stages, each building on the output of the previous one.
Stage 1: Crawl — The InsolvenzCrawler class initiates an HTTP session with the official portal, obtains a JSESSIONID, and systematically queries each combination of federal state (16) and case category (8+). It maintains ViewState tokens and handles JSF form submissions to navigate pagination. A configurable delay (default: 1 second) prevents overwhelming the server. Results are cached locally to enable incremental updates.
Stage 2: Parse — Each announcement passes through both the regex parser and the LLM parser. The regex parser provides fast, deterministic extraction for well-formatted announcements. The LLM parser (Gemini 2.0 Flash with function calling) handles edge cases and extracts fields that require understanding German legal context. The dual approach provides both speed and accuracy.
Stage 3: Enrich — Corporate insolvencies are enriched with industry classification data. The is_firma() heuristic identifies companies, and the WZ lookup service queries external APIs to determine the business sector. This transforms a legal record into a business intelligence record.
Stage 4: Analyze and Serve — The enriched data is stored in SQLite and served through a FastAPI web application on port 8420. The ML stack (scikit-learn, PyTorch, sentence-transformers) enables classification, similarity search, and trend analysis. NetworkX builds relationship graphs between entities (companies, administrators, courts), and Pandas/Matplotlib generate statistical reports.
insolvenzbekanntmachungen.de (JSF Application)
|
[Stateful HTTP Crawler]
JSESSIONID + ViewState tracking
16 States x 8+ Categories
|
[Dual Parsing Engine]
+-- Regex Parser (fast, deterministic)
+-- LLM Parser (Gemini 2.0 Flash, function calling)
|
[Enrichment Pipeline]
+-- Company Detection (keyword heuristics)
+-- WZ Industry Lookup (Serper Places API)
|
[Storage + Analysis]
+-- SQLite / JSON cache
+-- ML Classification (scikit-learn, PyTorch)
+-- Graph Analysis (NetworkX)
+-- Web UI (FastAPI, port 8420)
Data Model
Each insolvency record is structured as a rich data object:
InsolvenzEintrag:
aktenzeichen # Case reference (e.g., "IN 581/25")
veroeffentlichungsdatum # Publication date
gericht # Court name
sitz # Court location
schuldner_name # Debtor name (person or company)
schuldner_typ # "natuerliche Person" or "juristische Person"
geburtsdatum # Date of birth (persons only)
schuldner_adresse # Debtor address
registergericht # Commercial register court
register_nr # Commercial register number
geschaeftszweig # Business sector (WZ code)
verwalter # Insolvency administrator details
termine # Hearing dates and locations
Coverage
The crawler covers the complete German insolvency landscape:
| Dimension | Coverage |
|---|---|
| Federal States | All 16 (Baden-Wuerttemberg through Thueringen) |
| Case Categories | 8+ types (openings, measures, distributions, etc.) |
| Entity Types | Natural persons and legal entities |
| Historical Data | Configurable date range backfill |
| Update Frequency | Daily incremental crawls |
Looking Ahead
Inso Crawler fills a critical gap in the German business intelligence landscape. By making public insolvency data programmatically accessible and enriching it with ML-powered classification and industry data, it turns a bureaucratic portal into a strategic resource. Whether you need to monitor competitors, assess credit risk, identify market opportunities in distressed assets, or track insolvency trends across industries and regions, this platform provides the data foundation that commercial providers charge thousands of euros per month to deliver — built on open-source tools and publicly available data.