Video tutorial of using Vision RPA to log into Notion.
Vision RPA Training
Mental model
Vision RPA Wrk actions let you interact with a live automation session (Web Process Automation or Desktop Automation) by finding UI targets visually using one of three strategies:
Element text: OCR finds exact text on screen and targets it
Reference image: image matching finds a screenshot snippet you provide
Element description (AI): model interprets your natural-language description of the target
Everything in this suite depends on you having a valid session id and (for debugging) a live view link.
Wrkflow pattern: build faster using “Test” + reusable session IDs
The fastest build loop shown in the video is:
Test “Open a Browser Session”
Copy the session ID (valid for ~10 minutes in the demo)
Repeatedly Test Vision RPA Wrk actions against that same session ID
Why this matters:
You avoid rerunning an entire Wrkflow just to validate one click/field match.
You can iterate on match strategies (text vs image vs description) quickly.
Practical tip:
Keep the session ID handy (clipboard/notes) because you’ll reuse it constantly while tuning selectors and thresholds.
The 5 Vision RPA Wrk actions, when to use each
1) Click on an element using Vision RPA
Use when you need a reliable click on something visible.
Targeting options:
Element text (best when stable text exists)
Reference image (best when text is unreliable or not present)
Element description (AI) (fallback, see stability notes below)
Note: Match index
If the same text appears multiple times, Vision RPA orders matches in a reading order (top-left to bottom-right) and you choose which occurrence to click.
2) Click and fill in an element using Vision RPA
Use when you want a single action to:
Identify the field
Click it
Type a value
The video demonstrates this with a one-time code input.
Targeting options are the same as Click:
Element text
Reference image
Element description (AI)
Tip:
This is best when the field focus behaviour is flaky - you reduce the number of steps and avoid “typed into the wrong place” failures.
3) Check if an element is visible using Vision RPA
Use for branching and resilience:
popups that sometimes appear
A/B test variants
conditional screens post-login
Output is effectively boolean (found / not found), so you can:
gate your next step (“if visible, click X else click Y”)
implement simple wait/retry logic without forcing a click
4) Retrieve text elements using Vision RPA
Use to read the screen at scale without interacting with it.
What it does:
Takes a screenshot of the session
Runs OCR to extract all matching occurrences of a given word/phrase
Outputs include:
match indexes (aligned with the match indexing used by element text targeting)
bounding boxes + center points (useful for debugging and validation)
Good use cases:
verifying you’re on the right page/state
finding multiple data points to drive decisions
diagnosing why match index “1” isn’t the element you thought it was
Note:
Avoid trying to fan-out multi-threaded interactions based on multiple OCR matches - UI automation is sequential by nature and can become nondeterministic.
5) Retrieve element from a screenshot using Vision RPA
This is a preflight / offline testing action.
Use when:
you have a screenshot and want to experiment with:
element text matching
reference image matching
element description targeting
without spinning up a live session
Why it matters:
Faster iteration on matching strategy
Useful when session creation is expensive, unavailable, or rate-limited
Stability hierarchy: what to prefer and why
For reliable automations, the video establishes a clear preference order:
Element text (most stable when UI text is consistent)
Reference image (stable when UI is visually consistent; can be very strong for buttons/icons)
Element description (AI) (powerful, but “black box” - can vary as UI shifts)
Practical guidance:
Use element description when:
the same element looks different across states
you can’t capture a stable reference image
Otherwise default to element text / reference image for repeatability.
Debugging: how to read test results
The Click (and related) actions surface high-signal debugging info:
Search image: what the system detected on screen (the match region)
Match percentage: confidence of image match (for reference image flows)
Clicked element image: what it actually clicked (validate correctness)
Change detection: what changed after the click (used to confirm success)
Coordinates: reusable if you need absolute positioning later
Key idea:
The action clicks the center of the detected region (important if your reference image includes padding or whitespace).
Advanced controls you’ll likely need in production automations
These are the knobs the video calls out as your “make it reliable” toolkit:
Match tuning (reference image)
If match percentage is lower than expected or no match is found:
adjust the advanced matching settings (covered in a follow-up video)
Click offset (X/Y pixels)
Use when the target region is detectable, but you need to click:
slightly right/left of the match
slightly above/below the match
Common example: you match a label but need to click inside the input next to it.
Wait for match / element to appear
Increase wait time when:
pages load slowly
post-login transitions take seconds or minutes
Goal: avoid premature failure while the UI is still rendering.
Search section (reduce false positives + speed up)
Limit scanning to:
right half / left half / specific screen region (depending on what the UI supports)
Use when:there are multiple similar matches
the page is visually “busy”
you know where the element always appears
Change detection threshold + section
Use when clicks trigger slow or large UI transitions:
set expected change size (none/small/medium/large)
increase “wait for change” time if the UI updates slowly
ignore animated banners/ads by narrowing the change-detection region
Example architecture shown in the video (Notion login)
The demo Wrkflow is a good reference pattern for real projects:
Open browser session
Navigate to a site
Click login (element text + match index)
Enter email (typed input)
Click continue (reference image)
Fetch one-time code using Gmail actions (search email -> read content)
Enter code (click + fill using element description)
Continue and confirm logged-in state
Use Retrieve Text Elements to validate screen content post-login
Key takeaway:
Vision RPA handles the UI interactions; other Wrk actions (like Gmail) handle the system integration pieces.
