Analytics Web Data Commons with SPARQL

python
rdf
commoncrawl
data
Published

June 24, 2020

I am trying to understand how the JobPosting schema is used in Web Data Commons structured data extracts from Common Crawl. I wrote a lot of ad hoc Python to get usage statistics on JobPosting. However SPARQL is a tool that makes it much easier to answer these kinds of questions.

After reading in the graphs individually they can be combined into a rdflib.Dataset so we can query them all together. For this analysis I got 13,000 pages from the 2019 Web Data Commons Extract, each from a distinct web domain and containing exactly one job posting.

dataset = rdflib.Dataset()
for graph in graphs:
    dataset.add_graph(graph)

We can then execute SPARQL queries in RDFLib using dataset.query. The SPARQL 1.1 Specification is actually pretty easy to read and with a little practice it’s a simple language to learn. Here’s a query that gives summary statistics of the most common RDF types in the sample of graphs.

SELECT ?type (COUNT(?src) AS ?postings) (SUM(?n) as ?total) {
SELECT ?src ?type (COUNT(?type) AS ?n)
WHERE {
    GRAPH ?src
    {[] a ?type .}
}
GROUP BY ?src ?type
}
GROUP BY ?type
HAVING (COUNT(?src) > 50)
ORDER BY desc(?total)

Using Pandas we can make it more meaningful by calculating the proportion of pages with each RDF type, and the average number of times a type occurs in a page graph.

df = pd.DataFrame([[value.toPython() for value in row] for row in results],
                  columns = ['uri', 'n', 'total'])
df.assign(frac=lambda df: df.n/max(df.n),
          avg = lambda df: df.total / df.n)
URI n total frac avg
http://schema.org/JobPosting 13092 13092 1.000000 1.000000
http://schema.org/Place 8734 9301 0.667125 1.064919
http://schema.org/Organization 7972 9184 0.608921 1.152032
http://schema.org/PostalAddress 8065 9018 0.616025 1.118165
http://schema.org/MonetaryAmount 2958 2970 0.225940 1.004057
http://schema.org/PropertyValue 2875 2882 0.219600 1.002435
http://schema.org/ListItem 946 2871 0.072258 3.034884
http://schema.org/QuantitativeValue 2602 2619 0.198747 1.006533
http://schema.org/ImageObject 609 1057 0.046517 1.735632
http://schema.org/BreadcrumbList 939 988 0.071723 1.052183

The most common object is a JobPosting and we can construct a query to get the most frequently used properties.

PREFIX sdo: <http://schema.org/>
SELECT ?rel (COUNT(?src) AS ?postings) (SUM(?n) as ?total) {
SELECT ?rel ?src (COUNT(?src) AS ?n)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; ?pred ?data }
    BIND (replace(str(?pred), 'https?://schema.org/(JobPosting/)?', '') AS ?rel)
}
GROUP BY ?rel ?src
}
GROUP BY ?rel
ORDER BY desc(?postings)
Property Postings Total Fraction of Posts Average times per post
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 13092 13092 1.000000 1.000000
title 11862 12052 0.906049 1.016018
description 11323 11540 0.864879 1.019165
datePosted 10420 10515 0.795906 1.009117
jobLocation 9800 10423 0.748549 1.063571
hiringOrganization 9568 9720 0.730828 1.015886
employmentType 7702 8139 0.588298 1.056739
validThrough 4688 4691 0.358081 1.000640
baseSalary 3657 3713 0.279331 1.015313
industry 3328 4081 0.254201 1.226262
identifier 3214 3217 0.245493 1.000933
url 2744 2894 0.209594 1.054665
workHours 1352 1366 0.103269 1.010355
experienceRequirements 1235 1262 0.094332 1.021862
occupationalCategory 1152 1509 0.087993 1.309896
educationRequirements 959 991 0.073251 1.033368
salaryCurrency 904 910 0.069050 1.006637
qualifications 839 902 0.064085 1.075089
responsibilities 834 894 0.063703 1.071942
image 790 859 0.060342 1.087342
skills 726 795 0.055454 1.095041

Digging further we could extract the types of the baseSalary; while it’s mostly a MonetaryAmount (a complex object) it’s also often a string in some language (a literal).

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>

SELECT ?type (COUNT(?src) as ?n)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:baseSalary|sdo_jp:baseSalary ?data .
     OPTIONAL {?data a ?datatype .}
     BIND (coalesce(datatype(?data), ?datatype) as ?type)}
}
GROUP BY ?type
ORDER BY DESC(?n)
LIMIT 20
Type Count
http://schema.org/MonetaryAmount 2946
http://www.w3.org/1999/02/22-rdf-syntax-ns#langString 527
http://www.w3.org/2001/XMLSchema#string 135
https://schema.org/MonetaryAmount 72
None 16
http://schema.org/PriceSpecification 7
http:/schema.orgMonetaryAmount 6

As with JobPosting we can then dig into the most commonly used properties of a MonetaryAmount.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>

SELECT ?rel (COUNT(?src) AS ?postings) (SUM(?n) as ?total) {
SELECT ?rel ?src (COUNT(?src) AS ?n)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:baseSalary|sdo_jp:baseSalary ?salary .
     ?salary ?pred ?data .}
    BIND (replace(str(?pred), 'https?://schema.org/(MonetaryAmount/)?', '') AS ?rel)
}
GROUP BY ?rel ?src
}
GROUP BY ?rel
ORDER BY desc(?postings)
Type Fraction of all jobs Fraction of results Average Frequency
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 0.230734 1.000000 1.003972
value 0.215841 0.935452 1.005308
currency 0.206675 0.895730 1.004065
minValue 0.010922 0.047335 1.000000
maxValue 0.010769 0.046673 1.000000
unitText 0.002979 0.012910 1.000000

And we can continue into looking at the datatypes of a MonetaryAmount value, the RDF type when it’s blank or the Literal type.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_ma: <http://schema.org/MonetaryAmount/>


SELECT ?type (COUNT(?src) as ?n)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; (sdo:baseSalary|sdo_jp:baseSalary)/(sdo:value|sdo_ma:value) ?data .
     OPTIONAL {?data a ?datatype .}
     BIND (coalesce(datatype(?data), ?datatype) as ?type)}
}
GROUP BY ?type
ORDER BY DESC(?n)
LIMIT 20
Type n
http://schema.org/QuantitativeValue 2323
http://www.w3.org/1999/02/22-rdf-syntax-ns#langString 280
http://www.w3.org/2001/XMLSchema#string 152
None 7
http://schema.org/PropertyValue 4
https://schema.org/QuantitativeValue 2

These kinds of techniques could be templated and extended to build a full frequency table. These are generally fairly consistent with what I found before with a different method and a smaller sample so this increases my confidence in those results. You can see the full Jupyter notebook for details.