Created
April 6, 2023 14:30
-
-
Save dchaplinsky/c1098b868d389cd24080d6c8a2d702b3 to your computer and use it in GitHub Desktop.
Small bash script which downloads 1.6TB of extracted structured data of the common crawl and finds pages where HowTo/FAQ structured data is available.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# You will need `apt get parallel pv` to make it run | |
# download file containing urls | |
curl http://webdatacommons.org/structureddata/2022-12/files/file.list > urls.txt | |
# create output file | |
touch output.txt | |
# use parallel command to download/grep in parallel | |
cat urls.txt | pv -cN Input | parallel -j 4 "curl -s {} | zcat | grep -e '<http://schema.org/FAQPage>' -e '<http://schema.org/HowTo>'" | pv -cN Output > output.txt |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment