Created
November 8, 2022 13:04
-
-
Save pedrohenriqueromio/bb9fcfaca5198468b51349b4fac0639c to your computer and use it in GitHub Desktop.
Common regex used to extract data from Html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Get Hexadecimal color code | |
\#([a-fA-F]|[0-9]){3, 6} | |
Validate email address | |
/[A-Z0-9._%+-]+@[A-Z0-9-]+.+.[A-Z]{2,4}/igm | |
IPv4 address | |
/\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b/ | |
IPv6 address | |
(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])) | |
Thousands separator | |
/\d{1,3}(?=(\d{3})+(?!\d))/g | |
Get domain from url | |
/https?:\/\/(?:[-\w]+\.)?([-\w]+)\.\w+(?:\.\w+)?\/?.*/i | |
Sort keywords by word count | |
^[^\s]*$ matches exactly 1-word keyword | |
^[^\s]*\s[^\s]*$ matches exactly 2-word keyword | |
^[^\s]*\s[^\s]* matches keywords of at least 2 words (2 and more) | |
^([^\s]*\s){2}[^\s]*$ matches exactly 3-word keyword | |
^([^\s]*\s){4}[^\s]*$ matches 5-words-and-more keywords (longtail) | |
Valid phone number | |
^\+?\d{1,3}?[- .]?\(?(?:\d{2,3})\)?[- .]?\d\d\d[- .]?\d\d\d\d$ | |
Leading & trailing whitespaces | |
^[ \s]+|[ \s]+$ | |
Get img src | |
\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\'\ >]*) | |
Validate date in DD/MM/YYY format | |
^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$ | |
Valid ISBN | |
/\b(?:ISBN(?:: ?| ))?((?:97[89])?\d{9}[\dx])\b/i | |
Check zip code | |
^\d{5}(?:[-\s]\d{4})?$ | |
Valid twitter username | |
/@([A-Za-z0-9_]{1,15})/ | |
Credit card numbers | |
^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$ | |
Find css attributes | |
^\s*[a-zA-Z\-]+\s*[:]{1}\s[a-zA-Z0-9\s.#]+[;]{1} | |
Strip html comments | |
<!--(.*?)--> | |
Facebook profile url | |
/(?:http:\/\/)?(?:www\.)?facebook\.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-]*)/ | |
Check IE version | |
^.*MSIE [5-8](?:\.[0-9]+)?(?!.*Trident\/[5-9]\.0).*$ | |
Extract price | |
/(\$[0-9,]+(\.[0-9]{2})?)/ | |
Parse email header | |
/\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b/i | |
Match a particular filetype | |
/\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b/i | |
Match a url string | |
/[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi | |
Append rel="nofollow" to links | |
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)> | |
Media query match | |
/@media([^{]+)\{([\s\S]+?})\s*}/g | |
Match empty paragraph tags | |
<p>(\s| |<\/?\s?br\s?\/?>)*<\/?p> | |
Extract Heading(h1) tag | |
<h1>This is a heading</h1> | |
<h1>([^<]+)</h1> | |
Extract Hyperlink from (A) tag | |
<a href="https://www.datascraping.co">Typical Website Link</a> | |
<a href="([^"]+)">Typical Website Link</a> | |
Extract Hyperlink and Anchor Text from (A) tag | |
<a href="https://www.datascraping.co">Typical Website Link</a> | |
<a href="([^"]+)">([^<]+)</a> | |
Extract Image alt text and source from (IMG) tag | |
<img alt="screen scraping" src="https://cdn.datascraping.co/images/create-a-web-scraping-agent.jpg"> | |
<img alt="([^"]+)" src="([^"]+)"/> | |
Extract data attribute and price from (DIV) tag | |
<div data-id="17839" data-availability="InStock">USD 129.00</div> | |
<div data-id="(\d+)" data-availability="(\w+)">USD\s*([^"]+)<\/div> | |
Extract text from (STRONG) tag | |
<strong>My Bold text</strong> | |
<strong>([^<]+)</strong> | |
Extract text from (span) tag with some CSS class | |
<span class="some-css-class">My Favorite Data</span> | |
<span class="some-css-class">([^<]+)</span> | |
Extract META description content value | |
<meta name="description" content="the SEO description of web page in heading section" /> | |
<meta name="description" content="([^"]+)" /> | |
Scrape only the 1st 3 values in a table | |
<tr>\s*<td>([^<]+)</td>\s*<td>([^<]+)</td>\s*<td>([^<]+)</td>\s*</tr> | |
Extract paragraphs within a div | |
<div class=description> | |
<p> paragraph 1 </p> | |
<p> paragraph 2 </p> | |
<p> paragraph 3 </p> | |
<div class=description>(.*?)</div> | |
Ebates ---------------------------------------------------- | |
Store Names: <a [^>]*>([^<]+) | |
PercentValue: <a [^>]*>\s*([\d.]+) | |
DollarValue: <a [^>]*>\s*\$([\d.]+) | |
UptoDollarValue: <a [^>]*>\s*Up\sto\s\$([\d.]+) | |
UptoPercentValue: <a [^>]*>\s*Up\sto\s([\d.]+)\% | |
NoDiscount: <a [^>]*>\s*([No Discount]+) | |
CouponsOnly: <a [^>]*>\s*([Coupons Only]+) | |
InStoreOnly: <a [^>]*>\s*([In\-Store]+) | |
BeFrugal ---------------------------------------------- | |
StoreName: <tr>\s*<td><a [^>]*>([^<]+) | |
PercentValue: <td class="green"[^>]*>\s*([\d.]+) | |
DollarValue: <td class="green"[^>]*>\s*\$([\d.]+) | |
UptoDollarValue: <td class="green"[^>]*>\s*Up\sto\s\$([\d.]+) | |
UptoPercentValue: <td class="green"[^>]*>\s*Up\sto\s([\d.]+)\% | |
Extrabux --------------------------------------------- | |
StoreName: <a [^>]*>([^<]+) | |
PercentValue: <a class="cashBack transferLink"[^>]*>\s*([\d.]+) | |
DollarValue: <a class="cashBack transferLink"[^>]*>\s*\$([\d.]+) | |
UptoDollarValue: <a class="cashBack transferLink"[^>]*>\s*Up\sto\s\$([\d.]+) | |
UptoPercentValue: <a class="cashBack transferLink"[^>]*>\s*Up\sto\s([\d.]+)\% | |
CashbackBin ------------------ | |
Store Name: <h1[^>]*>([^<]+) | |
Vendor: title="([^"]*) | |
Rate: <td class="l lo" style="text-align: center;">[^>]*>([^<]*) | |
Bonus: <td class="l bonus_amount">[\s\S]*?<span class="card_secondary_text">([^<]*) | |
CBM ------------------------- | |
Match upto the 1st occurance of % | |
[^%]* | |
Cb ---------------------------------------------------- | |
https://www.couponbox.com/us/stores/a | |
https://www.couponbox.com/us/stores/b | |
https://www.couponbox.com/us/stores/c | |
https://www.couponbox.com/us/stores/d | |
https://www.couponbox.com/us/stores/e | |
https://www.couponbox.com/us/stores/f | |
https://www.couponbox.com/us/stores/g | |
https://www.couponbox.com/us/stores/h | |
https://www.couponbox.com/us/stores/i | |
https://www.couponbox.com/us/stores/j | |
https://www.couponbox.com/us/stores/k | |
https://www.couponbox.com/us/stores/l | |
https://www.couponbox.com/us/stores/m | |
https://www.couponbox.com/us/stores/n | |
https://www.couponbox.com/us/stores/o | |
https://www.couponbox.com/us/stores/p | |
https://www.couponbox.com/us/stores/q | |
https://www.couponbox.com/us/stores/r | |
https://www.couponbox.com/us/stores/s | |
https://www.couponbox.com/us/stores/t | |
https://www.couponbox.com/us/stores/u | |
https://www.couponbox.com/us/stores/v | |
https://www.couponbox.com/us/stores/w | |
https://www.couponbox.com/us/stores/x | |
https://www.couponbox.com/us/stores/y | |
https://www.couponbox.com/us/stores/z | |
href="https:\/\/www\.couponbox\.com\/us\/coupons\/([^"]*) | |
srcset="\/\/([^"]*) | |
href="([^"]*) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment