Skip to main content

Vision RPA Training

In this article you will learn the basics of using Vision RPA at Wrk

Wrk Product avatar
Written by Wrk Product
Updated over a week ago

Video tutorial of using Vision RPA to log into Notion.

Vision RPA Training

Mental model

Vision RPA Wrk actions let you interact with a live automation session (Web Process Automation or Desktop Automation) by finding UI targets visually using one of three strategies:

  • Element text: OCR finds exact text on screen and targets it

  • Reference image: image matching finds a screenshot snippet you provide

  • Element description (AI): model interprets your natural-language description of the target

Everything in this suite depends on you having a valid session id and (for debugging) a live view link.


Wrkflow pattern: build faster using “Test” + reusable session IDs

The fastest build loop shown in the video is:

  1. Test “Open a Browser Session”

  2. Copy the session ID (valid for ~10 minutes in the demo)

  3. Repeatedly Test Vision RPA Wrk actions against that same session ID

Why this matters:

  • You avoid rerunning an entire Wrkflow just to validate one click/field match.

  • You can iterate on match strategies (text vs image vs description) quickly.

Practical tip:

  • Keep the session ID handy (clipboard/notes) because you’ll reuse it constantly while tuning selectors and thresholds.


The 5 Vision RPA Wrk actions, when to use each

1) Click on an element using Vision RPA

Use when you need a reliable click on something visible.

Targeting options:

  • Element text (best when stable text exists)

  • Reference image (best when text is unreliable or not present)

  • Element description (AI) (fallback, see stability notes below)

Note: Match index

  • If the same text appears multiple times, Vision RPA orders matches in a reading order (top-left to bottom-right) and you choose which occurrence to click.


2) Click and fill in an element using Vision RPA

Use when you want a single action to:

  1. Identify the field

  2. Click it

  3. Type a value

The video demonstrates this with a one-time code input.

Targeting options are the same as Click:

  • Element text

  • Reference image

  • Element description (AI)

Tip:

  • This is best when the field focus behaviour is flaky - you reduce the number of steps and avoid “typed into the wrong place” failures.


3) Check if an element is visible using Vision RPA

Use for branching and resilience:

  • popups that sometimes appear

  • A/B test variants

  • conditional screens post-login

Output is effectively boolean (found / not found), so you can:

  • gate your next step (“if visible, click X else click Y”)

  • implement simple wait/retry logic without forcing a click


4) Retrieve text elements using Vision RPA

Use to read the screen at scale without interacting with it.

What it does:

  • Takes a screenshot of the session

  • Runs OCR to extract all matching occurrences of a given word/phrase

Outputs include:

  • match indexes (aligned with the match indexing used by element text targeting)

  • bounding boxes + center points (useful for debugging and validation)

Good use cases:

  • verifying you’re on the right page/state

  • finding multiple data points to drive decisions

  • diagnosing why match index “1” isn’t the element you thought it was

Note:

  • Avoid trying to fan-out multi-threaded interactions based on multiple OCR matches - UI automation is sequential by nature and can become nondeterministic.


5) Retrieve element from a screenshot using Vision RPA

This is a preflight / offline testing action.

Use when:

  • you have a screenshot and want to experiment with:

    • element text matching

    • reference image matching

    • element description targeting

  • without spinning up a live session

Why it matters:

  • Faster iteration on matching strategy

  • Useful when session creation is expensive, unavailable, or rate-limited


Stability hierarchy: what to prefer and why

For reliable automations, the video establishes a clear preference order:

  1. Element text (most stable when UI text is consistent)

  2. Reference image (stable when UI is visually consistent; can be very strong for buttons/icons)

  3. Element description (AI) (powerful, but “black box” - can vary as UI shifts)

Practical guidance:

  • Use element description when:

    • the same element looks different across states

    • you can’t capture a stable reference image

  • Otherwise default to element text / reference image for repeatability.


Debugging: how to read test results

The Click (and related) actions surface high-signal debugging info:

  • Search image: what the system detected on screen (the match region)

  • Match percentage: confidence of image match (for reference image flows)

  • Clicked element image: what it actually clicked (validate correctness)

  • Change detection: what changed after the click (used to confirm success)

  • Coordinates: reusable if you need absolute positioning later

Key idea:

  • The action clicks the center of the detected region (important if your reference image includes padding or whitespace).


Advanced controls you’ll likely need in production automations

These are the knobs the video calls out as your “make it reliable” toolkit:

Match tuning (reference image)

If match percentage is lower than expected or no match is found:

  • adjust the advanced matching settings (covered in a follow-up video)

Click offset (X/Y pixels)

Use when the target region is detectable, but you need to click:

  • slightly right/left of the match

  • slightly above/below the match
    Common example: you match a label but need to click inside the input next to it.

Wait for match / element to appear

Increase wait time when:

  • pages load slowly

  • post-login transitions take seconds or minutes
    Goal: avoid premature failure while the UI is still rendering.

Search section (reduce false positives + speed up)

Limit scanning to:

  • right half / left half / specific screen region (depending on what the UI supports)
    Use when:

  • there are multiple similar matches

  • the page is visually “busy”

  • you know where the element always appears

Change detection threshold + section

Use when clicks trigger slow or large UI transitions:

  • set expected change size (none/small/medium/large)

  • increase “wait for change” time if the UI updates slowly

  • ignore animated banners/ads by narrowing the change-detection region


Example architecture shown in the video (Notion login)

The demo Wrkflow is a good reference pattern for real projects:

  • Open browser session

  • Navigate to a site

  • Click login (element text + match index)

  • Enter email (typed input)

  • Click continue (reference image)

  • Fetch one-time code using Gmail actions (search email -> read content)

  • Enter code (click + fill using element description)

  • Continue and confirm logged-in state

  • Use Retrieve Text Elements to validate screen content post-login

Key takeaway:

  • Vision RPA handles the UI interactions; other Wrk actions (like Gmail) handle the system integration pieces.

Did this answer your question?