Company registration data is the backbone of business intelligence, compliance, and due diligence in the DACH region. Yet accessing this data programmatically from the official German Handelsregister and Austrian ZVR (Zentrales Vereinsregister) remains a significant technical challenge. Vereinsregister Universal Crawler is a high-performance web crawler with a modern control interface that automates the extraction of company registration documents, shareholder lists, and structured business data at scale.

The Problem

The German Handelsregister (commercial register) and Austrian ZVR contain critical information about every registered company: founding documents, current officers, registered addresses, financial filings, shareholder lists, and chronological change histories. This data is essential for:

  • Due diligence — Verifying company details before business relationships
  • Compliance (KYC/AML) — Know Your Customer and Anti-Money Laundering checks
  • Market research — Analyzing company formations, dissolutions, and ownership structures
  • Legal proceedings — Obtaining official company documents for court cases

However, the official register portals are designed for manual, one-at-a-time lookups. They offer no bulk download API, impose session-based access controls, and present data through complex multi-step web interfaces. For anyone needing data on hundreds or thousands of companies, manual extraction is simply not feasible.

„We needed shareholder lists for 5,000 companies across 20 German courts. Manual download would have taken months. The crawler did it in days.“

The Solution

The Vereinsregister Universal Crawler is a Node.js/TypeScript application that automates the entire document retrieval process. It navigates the register portals programmatically, handles session management and pagination, and downloads documents in parallel across multiple courts. A modern single-page web UI provides real-time monitoring, analytics, and control over the crawling process.

Key Features

  • Multi-Country Support — Crawl both German Handelsregister and Austrian ZVR registers
  • Multi-Court Parallel Processing — Crawl up to 5 courts simultaneously, each with its own worker pool
  • 7 Document Types — Download SI (structured data), AD (current printout), CD (chronological history), HD (historical document), DK (document archive), UT (entity carrier), and GL (shareholder lists)
  • Gesellschafterliste (Shareholder List) Support — Specialized extraction of shareholder lists via DK document tree navigation
  • Court Explorer — Browse courts by German state, view register number ranges, and queue multiple courts for bulk processing
  • Document Browser — Search and browse downloaded documents with filters and pagination
  • Company Search — Full-text search across all collected company data
  • On-Demand Retrieval — Retrieve specific documents by court and register number without running a full crawl
  • Real-Time Monitoring — Live progress tracking, log streaming, and run history with statistics
  • Analytics Dashboard — Visualize document coverage, crawl performance, and trends with Chart.js
  • Proxy Rotation — Built-in proxy support with health tracking and success rate monitoring
  • Notifications — Optional Discord and Slack webhooks for crawl status updates

Technology Stack

Component Technology Purpose
Runtime Node.js + TypeScript High-performance async I/O with type safety
Web Server Express REST API and static file serving
Database SQLite (better-sqlite3, WAL mode) High-performance persistent storage
Frontend Vanilla JS SPA Single-page application with multiple views
Charts Chart.js Analytics and performance visualizations
Parsers XJustiz, Austria, CD parsers Document-specific XML and data parsing
Proxy Custom ProxyManager Rotation, health tracking, success rates
Notifications Discord/Slack webhooks Crawl status alerts

Architecture

The crawler follows a service-oriented architecture with clear separation between crawling, parsing, storage, and presentation concerns.

Crawl Orchestration Layer

The CrawlerManager coordinates multi-court parallel processing. When a bulk crawl is initiated, it distributes work across up to 5 concurrent court crawlers. Each court crawler is managed by the CrawlerService, which handles the actual scraping logic, including session management, pagination, and retry handling.

Scraper Layer

The HandelsregisterScraper implements the protocol for interacting with the German register portal. It handles the multi-step document retrieval process: searching by register number, navigating document trees, and downloading individual documents. For shareholder lists (GL), it performs specialized DK tree navigation to locate and extract the most recent Gesellschafterliste.

Parser Layer

Three specialized parsers handle different document formats:

  • XJustizParser — Parses German XJustiz XML structured data (SI documents)
  • AustriaParser — Handles Austrian register document formats
  • CDParser — Processes chronological document histories

Storage Layer

SQLite with WAL (Write-Ahead Logging) mode provides high-performance concurrent reads during active crawls. The Database service manages all data persistence, while RunLogger tracks crawl runs with detailed statistics for historical analysis.

Proxy Management

The ProxyManager implements intelligent proxy rotation with health tracking. Each proxy’s success rate is monitored, and unhealthy proxies are automatically removed from the rotation pool. This ensures reliable, uninterrupted crawling even during extended multi-day operations.

API and Frontend

The Express server exposes a comprehensive REST API covering status monitoring, crawl control, document browsing, on-demand retrieval, search, and analytics. The frontend is a vanilla JavaScript SPA that communicates exclusively through this API, providing real-time dashboards, document browsers, court explorers, and analytics views.

Supported Document Types

Code Name Format Description
SI Strukturierte Inhalte XML Structured company data in XJustiz format
AD Aktueller Abdruck PDF Current official register printout
CD Chronologischer Abdruck PDF Complete chronological change history
HD Historischer Abdruck PDF Historical document snapshot
DK Dokumentenansicht Tree Document archive with navigation
UT Unternehmensträger XML Entity carrier information
GL Gesellschafterliste PDF Most recent shareholder list

Deployment

Deployment requires only Node.js and npm. After installing dependencies and configuring the .env file with proxy settings and notification webhooks, the server starts with npm start and is accessible on the configured port (default 9399). The application is self-contained with no external database dependencies.

Conclusion

The Vereinsregister Universal Crawler transforms what would be months of manual document retrieval into an automated, monitored, and scalable operation. With multi-court parallel processing, 7 document types including shareholder lists, intelligent proxy rotation, and a comprehensive analytics dashboard, it provides enterprise-grade access to German and Austrian business register data. For compliance teams, legal departments, and business intelligence operations, it is an indispensable tool for large-scale register data extraction.