Automated Information Extraction from the Semantic Web

Vasanth Gopal
Indix
Published in
4 min readJul 13, 2017

--

At Indix, we deal with product information, where the primary source of such data is obtained via crawling and parsing of retailer and brand websites. Parsing is an operation to extract content which is associated directly or indirectly with a product. Parsing a large scale of websites is a complex problem, as each website has its own structure. If it’s a complex problem, how do search engines identify and list product pages, irrespective of the page pattern? SEO techniques come handy here and many websites adopt their own SEO optimizations to make themselves appear higher in the search ranking.

What is Schema.org?

In the blog 9 Steps to Perfect Product Page SEO, we have recommended SEO techniques for a perfect product page and one of them is “Product Schema Markup”. “Product Schema Markup” tells search engines that your page is; About a product, what the product is, and the details about the product. A collaborative, community activity to come-up with such schema is Schema.org. Schema.org not only caters to product webpages, but has a broader vision of viewing the internet as structured data. Because of that, and the benefits such structured data can enable, majority of websites follow schema.org either partially or completely for its web-pages. At Indix, the Product, Offer schemas and its sub-schemas to auto-parse retailer and brand websites and produce structured product information.

What are the different Schema.org compliant markups?

There are 3 different types of markups:

  1. Micro-data: Microdata is a WHATWG HTML specification used to nest metadata within existing content on web pages.
  2. RDFa: RDFa (or Resource Description Framework in Attributes) is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents.
  3. JSON-LD: JSON-LD, or JavaScript Object Notation for Linked Data, is a method of encoding Linked Data using JSON.This allows data to be serialized in a way that is similar to traditional JSON.

There is a 4th type, Meta tags, not specifically defined in schema.org, but defined in W3C specification for SEO purposes with motivation to make internet a structured repository of information.

How does Indix auto parse Schema.org compliant websites?

Introducing our open-source javascript library — web-auto-extractor, which helps to extract schema.org compliant data from any web-page. The library accepts a HTML web-page as an input and gives back structured information as per the Schemas defined here — http://schema.org/docs/schemas.html. Find below an example of the code-snippet on how to use the library:

Input

Code snippet: https://github.com/indix/web-auto-extractor#input

Usage

Code snippet: https://github.com/indix/web-auto-extractor#output

Output

Code snippet: https://github.com/indix/web-auto-extractor#usage

Indix’s scale to auto-parse using web-auto-extractor

We picked 1840 retailer and brand websites to understand the scale at which their product pages are schema.org compliant. A list of top 12 fields within the Product and Offer schemas are picked to understand the coverage of each field for the 1,840 websites. The table below represents metrics around the collected information:

Below is the summary of the analysis:

As it can be seen, not all fields are 100% compliant. Meta-tags are the highest contributor in terms of Name, Image and Description. Whereas RDFa is the poorest contributor for majority of the fields. The adoption of Micro-data is on the higher band for the top 4 fields (Name, description, Image and Price). The surprising aspect is that of JSON-LD, which seem to have a broader adoption across all the fields. As more of websites adopt the JSON concepts, they are enabling themselves for JSON-LD.

To achieve the broader vision of making the internet data structured, a lot more websites need to open themselves to being structured and schema.org compliant. Indix is part of such a vision to organize and structure product information and one of our contribution to the vision is web-auto-extractor. Try out the library and share your feedback / issues faced in the GitHub repository — https://github.com/indix/web-auto-extractor.

Originally published at Indix.

--

--