Every day, German courts publish insolvency announcements on insolvenzbekanntmachungen.de — the official portal for bankruptcy proceedings across all 16 federal states. These announcements contain structured legal data about companies and individuals entering insolvency, including court details, creditor meetings, administrator appointments, and case timelines. But the portal itself is designed for manual lookup, not systematic analysis. Inso Crawler is a Python-based intelligence platform that crawls, parses, classifies, and enriches German insolvency data using a combination of traditional parsing, LLM-based extraction (Gemini 2.0 Flash), and machine learning classification, turning raw legal announcements into structured, searchable business intelligence.

The Problem

German insolvency data is public but practically inaccessible at scale. The official portal at neu.insolvenzbekanntmachungen.de uses a JSF (JavaServer Faces) web application with server-side state management, JSESSIONID cookies, and ViewState tokens — making it hostile to automated access. The search interface requires manual selection of federal states, date ranges, and case types. Results are displayed as pipe-delimited text blocks buried in HTML, with no API, no export function, and no way to perform cross-regional analysis.

For credit analysts, debt collection firms, market researchers, and risk management teams, this means that monitoring German insolvency trends requires either expensive commercial data providers or tedious manual searches across all 16 federal states, every single day.

The official insolvency portal publishes data for all of Germany, but the interface was designed for a single court clerk looking up a single case — not for anyone who needs to understand the bigger picture.

The Solution

Inso Crawler is a multi-layered data intelligence system that solves this problem through four coordinated components: a stateful HTTP crawler that navigates the JSF application, a dual parsing engine (regex + LLM), a business classification enrichment pipeline, and a web UI for browsing and analyzing the collected data.

Key Features

Stateful JSF Crawler

The crawler communicates natively with the official portal via HTTP — no browser automation required. It maintains JSF session state by tracking JSESSIONID cookies and ViewState tokens, navigating the application’s form submissions and postbacks programmatically. It systematically iterates across all 16 federal states and all case categories:

  • Sicherungsmaßnahmen (Precautionary measures)
  • Abweisungen mangels Masse (Rejections for lack of assets)
  • Eröffnungen (Case openings)
  • Entscheidungen im Verfahren (Procedural decisions)
  • Verteilungsverzeichnisse (Distribution schedules)
  • Restschuldbefreiung (Residual debt discharge)
  • Überwachte Insolvenzpläne (Supervised insolvency plans)

Dual Parsing Engine

Each announcement is processed by two complementary parsers:

Regex Parser (parser.py): A robust traditional parser that extracts structured fields from pipe-delimited announcement text using carefully crafted regular expressions. It extracts:

  • Case reference numbers (Aktenzeichen)
  • Court and publication dates
  • Debtor information (name, type, address, registration details)
  • Administrator details (name, title, role, contact information)
  • Hearing dates and locations
  • Filing deadlines and claim registration periods

LLM Parser (llm_parser.py): A Gemini 2.0 Flash-based extraction system that uses function calling for guaranteed structured JSON output. The LLM parser handles edge cases, unusual formatting, and ambiguous text that the regex parser might miss. It extracts the same fields but with the added intelligence of understanding natural language context in German legal text.

Business Classification Enrichment

For corporate insolvencies, the system enriches records with Wirtschaftszweig (WZ) industry classification codes. When the announcement text does not contain business sector information, the wz_lookup.py module queries the Serper Places API (Google Maps/MyBusiness data) to identify the company’s industry based on its name and location. This enrichment transforms raw legal data into business intelligence.

Company Detection Heuristics

The crawler automatically distinguishes between personal and corporate insolvencies using keyword-based heuristics that detect legal entity suffixes:

_FIRMA_KEYWORDS = [
    "gmbh", "ag", "ug", "e.k.", "ohg", "kg", "gbr", "ltd", "se",
    "stiftung", "verein", "genossenschaft", "partg", "eg",
    "gesellschaft mit beschränkter haftung",
]

Technology Stack

Layer Technology Purpose
Crawler Python 3.12, Requests HTTP-based JSF navigation
Traditional Parsing Regex, dataclasses Structured field extraction
LLM Parsing Google Gemini 2.0 Flash Intelligent text extraction
NLP spaCy (de_core_news_lg) German named entity recognition
ML Classification scikit-learn, PyTorch Document and entity classification
Data Analysis NumPy, Pandas, Matplotlib Statistical analysis and visualization
Graph Analysis NetworkX Relationship mapping between entities
NLP Embeddings sentence-transformers Semantic similarity and search
Web UI FastAPI, Uvicorn Web interface (port 8420)
Enrichment Serper Places API WZ industry code lookup
Storage SQLite (sqlitedict), JSON caching Persistent data storage

Architecture

The system operates as a pipeline with four stages, each building on the output of the previous one.

Stage 1: Crawl — The InsolvenzCrawler class initiates an HTTP session with the official portal, obtains a JSESSIONID, and systematically queries each combination of federal state (16) and case category (8+). It maintains ViewState tokens and handles JSF form submissions to navigate pagination. A configurable delay (default: 1 second) prevents overwhelming the server. Results are cached locally to enable incremental updates.

Stage 2: Parse — Each announcement passes through both the regex parser and the LLM parser. The regex parser provides fast, deterministic extraction for well-formatted announcements. The LLM parser (Gemini 2.0 Flash with function calling) handles edge cases and extracts fields that require understanding German legal context. The dual approach provides both speed and accuracy.

Stage 3: Enrich — Corporate insolvencies are enriched with industry classification data. The is_firma() heuristic identifies companies, and the WZ lookup service queries external APIs to determine the business sector. This transforms a legal record into a business intelligence record.

Stage 4: Analyze and Serve — The enriched data is stored in SQLite and served through a FastAPI web application on port 8420. The ML stack (scikit-learn, PyTorch, sentence-transformers) enables classification, similarity search, and trend analysis. NetworkX builds relationship graphs between entities (companies, administrators, courts), and Pandas/Matplotlib generate statistical reports.

insolvenzbekanntmachungen.de (JSF Application)
          |
    [Stateful HTTP Crawler]
    JSESSIONID + ViewState tracking
    16 States x 8+ Categories
          |
    [Dual Parsing Engine]
    +-- Regex Parser (fast, deterministic)
    +-- LLM Parser (Gemini 2.0 Flash, function calling)
          |
    [Enrichment Pipeline]
    +-- Company Detection (keyword heuristics)
    +-- WZ Industry Lookup (Serper Places API)
          |
    [Storage + Analysis]
    +-- SQLite / JSON cache
    +-- ML Classification (scikit-learn, PyTorch)
    +-- Graph Analysis (NetworkX)
    +-- Web UI (FastAPI, port 8420)

Data Model

Each insolvency record is structured as a rich data object:

InsolvenzEintrag:
  aktenzeichen          # Case reference (e.g., "IN 581/25")
  veroeffentlichungsdatum  # Publication date
  gericht               # Court name
  sitz                  # Court location
  schuldner_name        # Debtor name (person or company)
  schuldner_typ         # "natuerliche Person" or "juristische Person"
  geburtsdatum          # Date of birth (persons only)
  schuldner_adresse     # Debtor address
  registergericht       # Commercial register court
  register_nr           # Commercial register number
  geschaeftszweig       # Business sector (WZ code)
  verwalter             # Insolvency administrator details
  termine               # Hearing dates and locations

Coverage

The crawler covers the complete German insolvency landscape:

Dimension Coverage
Federal States All 16 (Baden-Wuerttemberg through Thueringen)
Case Categories 8+ types (openings, measures, distributions, etc.)
Entity Types Natural persons and legal entities
Historical Data Configurable date range backfill
Update Frequency Daily incremental crawls

Looking Ahead

Inso Crawler fills a critical gap in the German business intelligence landscape. By making public insolvency data programmatically accessible and enriching it with ML-powered classification and industry data, it turns a bureaucratic portal into a strategic resource. Whether you need to monitor competitors, assess credit risk, identify market opportunities in distressed assets, or track insolvency trends across industries and regions, this platform provides the data foundation that commercial providers charge thousands of euros per month to deliver — built on open-source tools and publicly available data.