From specification to code: Methodological insights from the LLM-supported development of a text style editor#

Part of a series on methodological insights from LLM-supported development projects

Introduction#

Over several months, various LLM-supported tools were developed, primarily as learning projects to explore the possibilities and limitations of LLM-supported coding. This article documents one of these experiments: the development of a text style editor. The focus is not on the tool itself, but on the methodological insights gained from the development process.

The experiment had a specific precursor: a profile-based style transformation tool that extracted style features from reference texts and applied them to new texts. Experience with this predecessor led to the question of whether an alternative approach – directly setting style parameters instead of style extraction – could deliver more practical results.

The tool: concept and functionality#

Basic idea#

The text style editor takes a rule-based approach to text transformation. Instead of extracting and copying styles from reference texts, users set the desired style features directly using sliders. The concept is based on the idea that many use cases do not require an exact copy of the style, but rather a targeted adjustment of individual style parameters.

Two-step process#

The transformation takes place in two optional steps:

Step 1 – Neutralisation: The input text is freed from stylistic peculiarities. There are ten dimensions to choose from: tonality, register, word choice, redundancy, subjectivity, perspective, intensifiers, structure, imagery and cultural references. The dimensions are categorised as ‘safe’ and ‘experimental’. The distinction is based on the assumption that overly aggressive neutralisation could distort the meaning of the text. The experimental dimensions (structure, imagery, cultural references) interfere more strongly with the text content and are therefore treated with caution.

Step 2 – Stylisation: The desired stylistic features are set using 34 controls in seven categories. The controls cover the following areas:

Tone and emotion (emotional, friendly, optimistic, cheerful, affectionate, dramatic)
Formality and politeness (formal, polite, direct, personal, confident)
Clarity and comprehensibility (simple, precise, clear, factual, confusing, hedging, nominal, abstract)
Creativity and stylistic devices (poetic, magical, ironic, sarcastic, funny, provocative, nostalgic)
Persuasion and rhetoric (persuasive, motivating)
Format and structure (length, structured, active)
Target audience and level (language level, target audience age, technical language)

Regulator types#

The system distinguishes between three types of regulators:

Polar regulators span a spectrum between two opposites, such as ‘formal’ to ‘informal’ or “emotional” to ‘sober’. The value range is from -10 to +10, with negative values reinforcing one pole and positive values reinforcing the other.

Intensity controllers control the expression of a single characteristic from 0 (off) to 10 (maximum). Examples include irony, poetry or objectivity.

Level controllers offer discrete options. The language level, for example, can be set to A1, A2, B1, B2, C1 or C2, and the target group to different age groups.

Intensity levels#

A central element are the intensity levels, which determine how strongly a stylistic feature is implemented. The gradation ranges from ‘light’ (values 1-2: ‘add a touch’) to ‘moderate’ (3-4), ‘distinct’ (5-6), ‘strong’ (7-8) and “extreme” (9-10: ‘maximally exaggerated, taken to the extreme’). This gradation was not part of the initial specification, but arose during testing when it became apparent that the style features were not being implemented clearly enough without explicit intensity specifications.

Presets and persistence#

For a quick start, 23 presets are available that provide predefined controller combinations for typical use cases – from ‘academic work’ to ‘business email’ to ‘fairy tales’ or ‘cynical commentary’. In addition, settings can be exported as hash codes and restored later. This feature was added later when it became apparent that the tool worked well and successful settings should be persisted.

The development process#

Initial situation and motivation#

The experiment arose from experience with a previous project. This had taken a profile-based approach: stylistic features were extracted from reference texts (such as texts by Goethe, Steve Jobs or Edgar Wallace) and stored as JSON profiles. These profiles contained detailed linguistic metrics such as average sentence length (in words), sentence length variance, complexity (subordinate clauses per sentence, nesting depth), passive voice ratio, nominal style score, hedging level, connector frequencies (causal, adversative, temporal), foreign word ratio, degree of abstraction and degree of formality.

An example profile for Goethe, for example, contained information such as ‘28.4 words average sentence length’, ‘complexity: high’, ‘passive proportion: 13%’, ‘nominal style: 68%’, ‘hedging level: medium’ and detailed connector distributions (temporal 42%, causal 28%, adversative 18%, conditional 12%). In addition, representative example sentences and a generated transformation prompt were saved.

The approach worked technically, but had practical limitations: the extracted profiles were complex and difficult for users to understand. Although it was possible to see that a profile provided certain metrics, there was no direct control. The style analysis was a black box – the result had to be accepted as it was. The insight was that users often want more direct control over individual parameters. The new tool should therefore take a different approach: synthesis instead of analysis, direct parameter control instead of profile copying.

Another aspect was the complexity of its predecessor: ~2,250 lines of code in a deeper module hierarchy, with separate components for style analysis, fine-tuning through before-and-after comparisons, and profile visualisation. This complexity was possibly oversized for the use case.

Specification phase#

Development followed a two-phase approach, which proved to be methodologically crucial. The first phase consisted of a detailed specification in dialogue with various LLMs.

The specification comprised approximately 1,400 lines and 50,000 characters – almost as extensive as the resulting code itself. In this phase, the following aspects were developed in dialogue with the LLM:

Project objectives and scope
Functional requirements (two-stage process, controller types, presets)
UI concept (grouped controllers, accordions, token counters)
Technical architecture (module structure, data models, API integration)
Configuration concept (environment variables, JSON files)

The specification was not a static document created in advance, but developed in dialogue. The LLM acted as a discussion partner, questioning concepts, suggesting alternatives and helping to clarify details.

Implementation phase#

Once the specification was complete, the code was generated in a single pass. The clear, comprehensive specification made this possible without any major correction loops. The result comprised approximately 2,000 lines of Python code and 500 lines of JSON configuration, spread across 12 files.

The architecture was deliberately kept flat: six Python modules (app.py, config.py, llm_client.py, models.py, prompt_builder.py, token_counter.py) instead of a nested structure. This decision was a direct consequence of experience with the more complex predecessor, which had around 2,250 lines of code in a deeper module hierarchy.

Iterations#

The initial generation was followed by three main iterations:

*Iteration 1 – Intensity adjustment:

Initial tests showed that the style features were not implemented clearly enough. The problem was analysed in dialogue with the LLM and a solution was developed: explicit intensity levels with clear instructions to the executing LLM (‘add a touch’ to ‘maximum exaggeration’).

Iteration 2 – History and hash code: Once it became apparent that the tool was working well, the desire arose to persist successful settings. In one pass, a history function and the export/import of settings as hash code were added.

The entire development took about two hours: one hour for the specification, the rest for implementation, deployment and enhancements. The project was carried out on the side in one day.

Methodological findings#

Specification as a key factor#

The key insight from this experiment was that the quality of the specification determines the quality of the generated code. The ratio of specification effort to implementation effort shifted significantly compared to previous projects. The actual development work increasingly took place in the design phase – code generation became the execution step of an already well-thought-out solution.

A direct comparison illustrates this: if the development process had started with a vague requirement such as ‘build me a tool for style transformation’, several iteration loops would probably have been necessary to clarify conceptual questions that had already been answered in the specification phase.

Dialogue as a specification method#

The specification was not created as an isolated document, but in dialogue with the LLM. This method proved to be productive: the LLM helped to clarify requirements, suggested alternatives and identified gaps in the concept. The resulting specification was more comprehensive and consistent than it would probably have been if it had been done entirely on our own.

Problem solving in later iterations also followed this pattern. In the case of intensity levels, for example, the developer understood the problem (the styles are not implemented clearly enough), but the technical solution was developed in collaboration with the LLM.

Learning transfer between projects#

The experiment shows how insights from previous projects are incorporated into new ones. The flat architecture was a conscious response to the experience with the more complex predecessor. The focus on synthesis rather than analysis addressed the realisation that users want more direct control.

Active countermeasures against overengineering#

Despite the conceptually simpler approach (only synthesis, no analysis), the new tool achieved a similar code size to its predecessor. This suggests that active countermeasures against overengineering by the LLM were necessary. LLMs tend to add abstractions and structures that are not necessary for the specific use case.

This tendency towards overengineering manifests itself in various forms: additional abstraction layers that do not fulfil a clear function; generic interfaces that only have one concrete implementation; complex configuration mechanisms for functions that are unlikely ever to be changed. Countermeasures require a clear idea of what is actually needed – another reason why a precise specification is so important.

Prompt engineering in the tool itself#

An interesting aspect of this project is that it itself engages in prompt engineering: the prompt builder generates prompts for the executing LLM based on the control settings. The quality of these generated prompts determines the quality of the text transformation.

The intensity levels are an example of iteratively improved prompt engineering. The initial version used abstract intensity specifications that the executing LLM did not implement consistently. The revised version contains concrete instructions: ‘Add a hint of it’ for low intensity, ‘Apply this to the MAXIMUM and exaggerate it! Take it to the extreme!’ for highest intensity.

The neutralisation prompts are formulated with similar precision. The dimension ‘tonality’, for example, is translated as: ‘Remove all emotional colouring such as irony, sarcasm, anger, joy or sadness. The text should be emotionally neutral.’ This specificity reduces room for interpretation and leads to more consistent results.

Evolution of the methodology#

With increasing experience, the specifications become longer and more precise. The ratio of specification to code in this project (1,400 lines / 50,000 characters for 2,000 lines of code) is higher than in previous experiments. This development reflects the realisation that precision in the specification saves time during implementation.

The evolution can also be described in qualitative terms: earlier specifications primarily described the ‘what’ (which functions the tool should have), while later specifications increasingly also describe the “how” (technical architecture, data models, UI structure) and the ‘why not’ (deliberately excluded features, avoided complexity). This completeness reduces room for interpretation and prevents the LLM from making its own assumptions that will have to be corrected later.

Current status and outlook#

The tool is currently being evaluated by several people with different perspectives. This evaluation is deliberately broad in scope: technically savvy users are testing the limits of the control combinations, while less technical users are evaluating usability and comprehensibility. Initial findings show that the tool also works well for finer style adjustments – not just for drastic transformations such as ‘turn this text into a fairy tale’.

The two-stage architecture (neutralisation → stylisation) seems to help achieve the desired styles more clearly. The hypothesis behind this is that if the source text already has strong stylistic characteristics, the executing LLM may conflict between the original style and the desired target style. Upstream neutralisation reduces this conflict. The evaluation will show whether this hypothesis is empirically valid.

A concrete example: an emotionally charged complaint text is to be transformed into a factual report. Without neutralisation, the original emotionality often ‘shines through’. With upstream neutralisation (dimensions: tonality, subjectivity, intensifiers), the text is first reduced to its informational content before being stylised into a factual report. The results are more consistent.

Potential further development#

For hypothetical productive use, simplification of the UI would be necessary. Although 34 controls offer fine control, they overwhelm average users. Several approaches are conceivable:

Simplified modes: A ‘simple’ mode could display only the most important 5-8 controls and hide the details behind an ‘advanced’ switch.

Intelligent presets: Instead of static presets, context-sensitive suggestions could be generated based on the input text (‘This text seems very informal – would you like to make it more formal?’).

Target style description: Users could describe the desired style in natural language (‘professional, but not stiff’), and the system would derive the appropriate control settings.

However, these enhancements would themselves be the subject of further experimentation – whether they actually lead to better usability would need to be evaluated.

Conclusion#

The experiment confirms and expands on findings from previous projects. The main methodological finding – specification as a key factor – is not surprising, but its quantitative significance (specification almost as extensive as code) deserves attention. It suggests that the role of the developer in LLM-supported coding is shifting: from implementation to conception, from programming to precise requirements definition.

The separation into specification and implementation phases has proven to be productive. It allows conceptual issues to be clarified before code is generated, thus reducing time-consuming correction loops. Dialogue with the LLM is not only useful for implementation, but also for the development of the specification itself.

Metrics#

Category	Value
Code size	~2,000 lines of Python
Configuration	~500 lines of JSON
Files	12
Specification	~1,400 lines, ~50,000 characters
Total time	~2 hours
Time distribution	~1h specification, rest implementation/deployment/extensions
Sessions	1 day
Main iterations	3
Spec:code ratio	~1:1.5 (based on characters)
Controls	34 in 7 categories
Presets	23
Neutralisation dimensions	10 (7 safe, 3 experimental)

Technical overview#

Stack: Python, Gradio 6, OpenAI-compatible API (vLLM), tiktoken

Modules:

app.py – Main application and UI
config.py – Configuration management
llm_client.py – API communication with retry logic
models.py – Data models (controllers, settings, results)
prompt_builder.py – Generation of LLM prompts
token_counter.py – Token counting

Deployment: Docker with docker-compose, configurable via environment variables