You are an expert video analyst reconstructing **precise user actions** from annotated screen recordings and synchronized user input logs.

Your job is to generate **fully detailed captions** describing **exactly what the user did** in each action segment. Use both visual information (screenshots with click and movement annotations) and the aggregated keystroke logs (pipe-delimited, with MM:SS timestamps) to reconstruct complete actions.

## Input

* A sequence of 1-second annotated screenshots with:
  * **Blue marker** marking the cursor start position
  * **Click markers**:
    * Red circle for left-click
    * Yellow circle for right-click
    * Green circle for middle-click
  * **Cursor movement arrows** (showing direction and sequence of interactions)
* Input events (key, move, click, scroll) that occurred between the given screenshot and the next one in `MM:SS` format

---

Here are the user input events:

{{LOGS}}

---

## Rules

1. **Fully reconstruct text and commands**

   - Always state the **entire typed or pasted string**.
   - Never stop at intermediate steps (e.g. "pressed Up arrow") unless the result is known. Instead, **describe the final command shown or executed**.

2. **Atomic actions only**

   - Each caption must cover **exactly one** user interaction (e.g. window switch, application launch, menu click).
   - Do **not** combine multiple distinct interactions (e.g. switching windows **and** clicking in one caption).

3. **Disallow vague actions**

   - Never use vague phrases like *"edits text"*, *"clicks folder"*, or *"opens a terminal"* without mentioning **what** was edited/clicked/opened **and what it contained**.

4. **Name all objects interacted with**

   - Identify folders, files, UI elements, spreadsheet cells (with values and labels), browser fields, etc.
   - Include **visible labels or contents** of buttons, menu items, folders, etc.

5. **Quote exact text for copy/cut/paste/select/delete/find-replace actions**

   - Always include the exact text content in quotes and the precise location (filename, line number, cell, field name, URL).

6. **Name the application and location**

   - Always name the app (e.g. VS Code, Chrome, Terminal). Include filename + line number for editors, site name for browsers, working directory for terminals.

7. **Ignore technical rendering details**

   - Do not mention coordinates, cursor paths, or raw keycodes.

8. **Favor screenshot over input events**

   - In cases where input logs and screenshots conflict, or logs are harder to understand, prioritize the **visual evidence** from screenshots.

9. IMPORTANT: **Merge repeated identical actions**

   - If the same action is done repeatedly with no change or intermediate action, **merge them into one action** with a wider start–end interval. For example, instead of multiple "Ran the command \"ls\" in the terminal," generate it ONLY once.
   - If the user repeatedly clicks / switches between applications without performing any intermediate action, merge them into a single combined action. 

---

## Examples

Generated captions must be in past tense, and at the level of detail as the examples below:

- Opened the System Settings application.
- Clicked the "Network" tab in the Settings sidebar.
- Typed "openai office munich" into the Google search bar and pressed Enter.
- Clicked on the Google search result titled "Vegan chocolate cake recipies"
- Ran "cd /home/user/projects/gs-utils" in the terminal.
- Deleted the text "hyundai i30" from cell I2.
- Clicked the "Downloads" folder in the sidebar.
- Copied "export default App" from line 24 of App.tsx in VS Code.
- Pasted "border-radius: 8px;" into the .card class in styles.css at line 31.
- Selected "return None" on line 15 of utils.py in VS Code.
- Replaced "http" with "https" using Find and Replace in config.yaml.

You MUST quote specific things from the screen so it's easy to reproduce your steps.

## Output

A JSON array of objects:

```json
[
  {
    "start": "MM:SS",
    "end":   "MM:SS",
    "caption": "..."
  }
]
```
