Architecture¶

Code Analyzer is designed as a two-part application with a shared YAML store. A first application captures and analyses the source code; a second application evaluates the results and generates reports. Both communicate exclusively through the file system and can be operated independently of each other. Within each application, data access, aggregation, visualisation, and LLM connection are encapsulated in separate layers; file analysis runs asynchronously with configurable parallelism.

At a glance¶

Two-part architecture with Code Scanner and Code Analyzer as separate Gradio applications.
YAML directory as the persistent interface between capture and evaluation.
Static structural analysis before any LLM call to reduce hallucinations.
Multi-stage LLM pipeline with role-based specialisation (four prompts per file, seven prompts system-wide).
Asynchronous file processing with semaphore-controlled parallelism.
OpenAI-compatible LLM adapter; local and cloud endpoints are interchangeable.
Helper scripts for re-analysis and YAML diagnostics complement the core system.

Architecture overview¶

The application is structured into four functional layers.

The capture layer in the Code Scanner traverses the project directory and identifies the source files to be analysed using language-specific patterns. It is complemented by a structural analysis layer that determines packages, layers, and build modules purely statically (via regex and keyword lists) and derives the architectural style from them.

The analysis layer handles the LLM-based investigations. For each file, four independent calls are made against an OpenAI-compatible endpoint, each with its own functional role and a structured JSON response. An orchestration component controls the parallelism via a semaphore.

The storage layer persists all results as YAML in a hierarchical directory structure (analysis/files/<package>/<ClassName>.yaml), complemented by a project-wide structure file and a summary with project metrics.

The evaluation layer in the Code Analyzer reads these YAMLs, aggregates them along different axes (package, module, domain, issue category, interface), and presents the result in a tab-based dashboard. A dedicated LLM component then produces a system-wide deep analysis in seven steps as well as a complete report.

Workflow and data flow¶

flowchart TB
    subgraph Sources
        SRC[Source code project<br/>Java / PHP / Python]
        LLM[OpenAI-compatible<br/>LLM endpoint]
    end

    subgraph Scanner["Code Scanner"]
        FS[FileScanner<br/>file detection]
        PSA[ProjectStructureAnalyzer<br/>layers, modules, style]
        ORCH[Orchestration<br/>async semaphore]
        CA[CodeAnalyzer<br/>4 LLM prompts per file]
        SM[StorageManager]
    end

    subgraph Storage
        YAML[("analysis/<br/>project_structure.yaml<br/>summary.yaml<br/>files/{pkg}/{Cls}.yaml")]
    end

    subgraph Analyzer["Code Analyzer"]
        ADR[AnalysisDataReader<br/>cached YAML loader]
        DA[DeepAnalyzer<br/>aggregation]
        UIH[UIHierarchyAnalyzer]
        FA[FunctionalityAnalyzer]
        LLA[LLMAnalyzer<br/>7-step deep analysis<br/>+ report generator]
        DASH[Tab dashboard]
    end

    subgraph Helpers["Helper scripts"]
        SA[storage_analyzer<br/>re-analysis]
        YF[yaml_fixer<br/>diagnostics]
    end

    subgraph Outputs
        REPORT[Markdown report]
        CSV[CSV export<br/>capabilities / issues]
    end

    SRC --> FS
    FS --> PSA
    FS --> ORCH
    PSA --> SM
    ORCH --> CA
    CA <--> LLM
    CA --> SM
    SM --> YAML

    YAML --> ADR
    ADR --> DA
    ADR --> UIH
    ADR --> FA
    DA --> DASH
    UIH --> DASH
    FA --> DASH
    DA --> LLA
    UIH --> LLA
    FA --> LLA
    LLA <--> LLM
    LLA --> REPORT
    DASH --> CSV

    YAML -.-> SA
    YAML -.-> YF
    SA -.-> CA

The workflow begins in the Code Scanner: the FileScanner identifies the relevant source files, the ProjectStructureAnalyzer builds the package tree and derives the architectural style. The orchestration controls the subsequent parallel LLM analysis: an asyncio semaphore limits the file analyses running in parallel (default: 3). For each file, the CodeAnalyzer performs four separate calls against the LLM endpoint — business logic, technical aspects, interfaces, and issues. Each analysis is written individually by the StorageManager as YAML into a package-oriented directory structure, complemented by a structure file and a summary with project metrics.

The Code Analyzer builds on this YAML inventory. The AnalysisDataReader loads and caches the data; specialised analysers (DeepAnalyzer, UIHierarchyAnalyzer, FunctionalityAnalyzer) form aggregations depending on the view (package, domain, interface, issue category). The results are presented in several tabs of the Gradio dashboard, including a matplotlib-based visualisation of the API hierarchy. The LLMAnalyzer uses the aggregates to produce a system-wide deep analysis in seven specialised calls as well as a complete Markdown report.

Multi-stage LLM pipeline¶

The LLM connection follows a role- and stage-based pattern. At the file level, four separate prompts are executed, each with its own role: Senior Software Architect for business logic, technical aspects, and interfaces; Senior Security Engineer for issues. All prompts enforce a pure JSON response and use a low temperature (0.2) and a moderate token budget (1500). At the system level, seven further calls are made against aggregated data, which together form the deep analysis (overview, architecture, business domains, interfaces, quality, modernisation, executive summary). A final call with an increased token budget (8000) produces the complete report in nine sections. This separation into specialised, focused calls is a central characteristic of the architecture — it replaces a single general prompt with a controlled sequence with clear responsibilities.

Concurrency and robustness¶

File analysis runs asynchronously via asyncio.gather with a semaphore that limits the parallel LLM calls. Errors in individual files are caught and marked as status: error in the respective YAML, without aborting the overall analysis. A validation view in the Code Analyzer checks the YAML inventory for completeness and consistency; for missing analyses, the CLI tool storage_analyzer.py can fill in the gaps in a targeted manner. The second CLI tool yaml_fixer.py diagnoses discrepancies between the summary and the file inventory. Source code is truncated to a configured maximum length before being passed to the LLM, in order to respect token limits.

Configuration and deployment¶

Both applications are started as stand-alone Python processes (Code Scanner on port 7860, Code Analyzer on port 7861) and provide a Gradio web interface. Configuration is handled via environment variables (LLM endpoint, model name, API key) and the UI itself (project path, language, parallelism). Python 3.9 or higher is required; a requirements.txt lists the direct dependencies.

Technology overview¶

Language and runtime — Python 3.9 or higher; asynchronous processing with asyncio and aiofiles.
Web interface — Gradio (≥ 4.0) for both applications.
LLM connection — OpenAI Python SDK (≥ 1.0) against any OpenAI-compatible endpoint (OpenAI API, Ollama, vLLM, LM Studio).
Data storage — YAML as the intermediate and exchange format, read and written via PyYAML.
Aggregation and tables — pandas for the data frames in the dashboard.
Visualisation — matplotlib (Agg backend) and Pillow for the hierarchy renderings.
Table formatting — tabulate for Markdown output.
Deployment — stand-alone Python processes; no containerisation setup is provided.