Germany’s largest digital industry association, Bitkom e.V., publishes thousands of studies, position papers, press releases, and guides every year. Yet all this knowledge lives scattered across bitkom.org in isolated pages and PDFs, making it nearly impossible to trace how topics evolve, who the key players are, or what Bitkom’s official stance is on any given regulation. Bitkom Universe changes that by crawling, indexing, and interconnecting the entire public output of Bitkom into a unified, semantically searchable knowledge base with an interactive knowledge graph.
The Problem
Bitkom.org hosts over 7,400 URLs spanning press releases, publications, topic pages, organization pages, and media libraries. The site has no unified search that understands context, no way to trace relationships between people, organizations, topics, and regulations, and no trend analysis capability. Researchers, policy analysts, and industry professionals who need to answer questions like „What is Bitkom’s position on AI regulation?“ or „Which companies sponsor the most cloud studies?“ are forced to manually sift through thousands of pages.
„Show me the network around Achim Berg“ — a query that should take seconds, not hours of manual research.
The Solution
Bitkom Universe is an intelligent knowledge system that combines web crawling, PDF extraction, German-language NLP, and knowledge graph construction into a single platform. It transforms Bitkom’s scattered content into an interconnected, queryable knowledge base with five core capabilities:
- Semantic Search — Natural language queries instead of keyword matching
- Knowledge Graph — 154+ interconnected entities spanning documents, people, organizations, topics, and regulations
- Trend Analysis — Track emerging topics and their lifecycle with computed „Hype Scores“
- Media Impact Tracking — Monitor where Bitkom publications get cited externally
- Timeline Visualization — View the historical evolution of topics and publications
Key Features
Interactive Knowledge Graph
A force-directed graph visualization built with Vis.js displays entities and their connections. Users can filter by entity type (Topics, People, Organizations, Laws), click nodes to explore connections, and adjust the physics simulation parameters. With 154 nodes and 2,281 edges, the graph reveals hidden relationships between Bitkom’s vast ecosystem of stakeholders and topics.
Hype Score and Trend Analysis
Every topic receives a computed Hype Score based on publication frequency, recency, and citation patterns. Trend indicators show whether a topic is rising, stable, or declining, enabling analysts to spot emerging themes before they become mainstream.
Semantic Search Engine
The search engine processes natural language queries, recognizes entities in search terms, and returns results enriched with entity relationships. It supports document type filtering and connection discovery between entities, moving far beyond simple keyword matching.
Comprehensive Data Pipeline
- Sitemap Crawler — Discovers and crawls 7,457 URLs systematically
- PDF Extractor — Processes 2,005 PDF documents with PyMuPDF
- NER Pipeline — German Named Entity Recognition via spaCy’s
de_core_news_lgmodel - Graph Builder — Constructs the knowledge graph with NetworkX, establishing relationships like
authored_by,sponsored_by,covers_topic, andmentioned_in
Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Crawling | Requests, BeautifulSoup, PyMuPDF | Web scraping + PDF extraction |
| NLP/NER | spaCy (de_core_news_lg) | German entity recognition |
| Graph | NetworkX | Knowledge graph construction |
| Frontend | HTML5, JavaScript, Chart.js, Vis.js | Interactive visualizations |
| Data | JSON (PostgreSQL/Neo4j ready) | Document + metadata storage |
Architecture
The system follows a pipeline architecture with four main stages. First, the Crawl Layer systematically traverses Bitkom’s sitemap to discover and download all public pages and PDF documents. The crawler respects rate limits and handles the site’s complex navigation structure. Second, the NLP Layer processes every document through spaCy’s German language model to extract named entities — people, organizations, topics, laws, and locations. Third, the Graph Construction Layer takes the extracted entities and builds a knowledge graph, computing relationships between them based on co-occurrence, authorship, and topic coverage. Finally, the Presentation Layer renders everything through an interactive web frontend with search, graph visualization, timeline charts, and media impact dashboards.
The data models are carefully designed around six entity types:
Document → title, date, authors[], topics[], full_text, pdf_url, hype_score
Person → name, roles[], organization, documents[], influence_score
Organization→ name, bitkom_member, topics[], activity_score
Topic → name, aliases[], related_topics[], trend_direction, hype_score
Law → name, aliases[], relevant_documents[]
A planned REST API will expose endpoints for document search, graph navigation, semantic queries, and media coverage tracking. An MCP (Model Context Protocol) server is also planned for integration with AI assistants like Claude.
Current Scale
| Metric | Value |
|---|---|
| URLs Identified | 7,457 |
| Documents Crawled | 2,500+ |
| PDFs Extracted | 2,005 |
| Characters Indexed | 49.4 million |
| Knowledge Graph Nodes | 154 |
| Knowledge Graph Edges | 2,281 |
| Entity Types | 10 |
Looking Ahead
Bitkom Universe represents a powerful approach to making institutional knowledge accessible and interconnected. By combining modern NLP with graph-based knowledge representation, it demonstrates how scattered organizational content can be transformed into actionable intelligence. The project targets search relevance (NDCG) above 0.8, entity extraction accuracy above 90%, and sub-3-second query response times — ambitious goals that reflect the platform’s commitment to being a genuinely useful research tool for anyone navigating Germany’s digital policy landscape.