Skip to main content

Retrieve data from HTML text

Wrk Product avatar
Written by Wrk Product
Updated today

Retrieve value(s) from provided HTML text.

Application

  • HTML

Inputs (what you have)

Name

Description

Data Type

Required?

Example

HTML text

HTML to be extracted

Text (Long)

Yes

Data extraction rules

Instructions for what data should be extracted from the html. See the documentation for configuration details. Example {"title":"h1"}

Text (Long)

Yes

Outputs (what you get)

Name

Description

Data Type

Required?

Example

Extracted content

JSON result of the data extracted from the website

Text (Long)

No

Outcomes

Name

Description

Success

This status is selected if the job has successfully completed.

No result

This status is selected if the job has successfully completed but no result was produced.

Configuration Rules

Regarding the configuration of the "Retrieve data from a website" Wrk Action, we recommend a helpful tool to simplify the process of setting up data extraction rules. The tool is called "jQuery Unique Selector," and you can access it by installing the Chrome extension from this link: jQuery Unique Selector Chrome Extension.
​
Here's how to use it:
1. After installing the extension, click on the magnifying glass icon() in your browser. This will activate the selector mode.

2. Select the specific item you want to extract from the website by clicking on it.

3. Copy the "Selected Element Selector" that the extension provides, and use it in your data extraction rules for the "Retrieve data from a website" action. Repeat this process for all the content you need to retrieve from the website.

4. To disable the extension, simply click the magnifying glass icon again.

This tool should help you with most data extraction tasks for the majority of sites. However, please keep in mind that some situations may require additional configuration. If you come across any such scenarios, don't hesitate to reach out for further assistance.

Description

Code Example

Simple rule to extract an element's h1 text content using a CSS selector.

{"main_heading": "h1"}

Simple rule to extract an element's subtitle text content using an ID.

{"sub_heading": "#subtitle"}

Rule to extract an HTML attribute, such as the href from a link.

{"link": "a@href"}

Complex rule with a selector, specifying the desired output format.

{

"main_heading": {

"selector": "h1",

"output": "text"

}

}

Rule to extract the HTML content of an element.

{

"title_html": {

"selector": "h1",

"output": "html"

}

}

Rule to extract an attribute by using the "@" prefix in the output field.

{

"title_id": {

"selector": "h1",

"output": "@id"

}

}

Rule to extract data from a table and format it as a JSON object.

{

"table_json": {

"selector": "table",

"output": "table_json"

}

}

Rule to extract data from a table and format it as an array.

{

"table_array": {

"selector": "table",

"output": "table_array"

}

}

Simple syntax for extracting text content and attributes without specifying output or type.

{

"main_heading": "h1",

"link": "a@href"

}

Rule to extract and format information from a table using JSON representation.

{

"table_json": {

"selector": "#table_id",

"output": "table_json"

}

}

Rule to extract and format information from a table using array representation.

{

"table_array": {

"selector": "#table_id",

"output": "table_array"

}

}

Rule to extract the first matching element for a selector.

{

"first_post_title": {

"selector": ".post-title",

"type": "item"

}

}

Rule to extract all matching elements for a selector.

{

"all_post_titles": {

"selector": ".post-title",

"type": "list"

}

}

Rule for cleaning the extracted text content by default.

{

"first_post_description": {

"selector": ".card > div",

"clean": true

}

}

Rule to extract content without cleaning, preserving whitespace and special characters.

{

"first_post_description": {

"selector": ".card > div",

"clean": false

}

}

Nested extraction rule to gather detailed information from multiple items.

{

"articles": {

"selector": ".card",

"type": "list",

"output": {

"title": ".post-title",

"link": {

"selector": ".post-title",

"output": "@href"

},

"description": ".post-description"

}

}

}

Rule to extract all links from a page, returning an array of href attributes.

{

"all_links": {

"selector": "a",

"type": "list",

"output": "@href"

}

}

Rule to extract text and href for each link, providing a more detailed structure.

{

"all_links": {

"selector": "a",

"type": "list",

"output": {

"anchor": "a",

"href": {

"selector": "a",

"output": "@href"

}

}

}

}

Rule to extract all textual content from a page body.

{"text": "body"}

Rule to extract all email addresses from a page using mailto links.

{

"email_addresses": {

"selector": "a[href^='mailto']",

"output": "@href",

"type": "list"

}

}

Did this answer your question?