Web scraping concepts from simple to expert

Intro

Web scraping is the act of extracting data from publicly available online interfaces with reverse engineering and automation. Scraping gives you precious data which helps you in your research, can give you an edge over your competitors.

Some ways web scraping can be useful for you:

knowing all your competitors prices over a large set of products
having an up-to-date picture of the real estate market
analyzing what people tells online about your service

It's really up to your imagination, what data provides you value.

Technology

Let's analyse how your target works. Sometimes you find the precious data in html, sometimes you have to dig deeper. In this section I'll cover the reverse engineering part.

Right click > View page source In case of server-side rendering, this is where you find the data you need. The way of extraction depends on the data.

Regex

In an ideal situation, a regular expression is enough. Regex is always difficult to write, but once you have it, it works. This approach has the advantage of no matter how the structure of HTML changes, you get your data.

Dom parsing

HTML is a subset of XML. All modern programming languages has a way of parsing XML, and usually HTML too. The difference is the DOM element selector type. You can always count on XPATH, but you will certainly find implementations with CSS selectors too. The hardest part of scraping is making your bot reliable. Look for IDs and classes the target uses frequently as well. Identifiers used both in CSS and JS are the best, those are the least likely to be modified because of the fear of breaking something.

Use Devtool to find these elements. In Firefox Right click element > Press q, in Chrome Right click element > Inspect. You can find used CSS selectors, as well as events attached on DOM elements.

JSON

Client side rendering technologies are producing confuscated HTML, but they also expose a clean JSON API to use. Open up Network tab in your Devtool, clear it's current output, refresh the page. If your target has a feed, scroll down a couple times, if it's a searchable list, search with some parameters. As you interact with the page, the Network tab provides you the data exchanged between the client and server. Filter for XHR, click on the top request, expand the request details view, click Response there, click back to the list item, and move across requests with your arrow keys. This way you can quickly identify the URLs of your interest. Look out for the Domain column. Sometimes data coming from a subdomain, or a different domain than you are currently visiting.

Headless browsers & UI testing frameworks

Sometimes the best web scraping technologies are marketed as UI testing automation frameworks, look out for those keywords when you chose your tech. Headless browsers are browsers without visual output / GUI. You can launch a website in headless browser, query it's DOM, run JS, provide user input, and make screenshots. Using headless browsers for web scraping is the most resource wasting and unreliable way, but sometimes unavoidable. Keep in mind that when you open a website, it's "complete" state requires a couple seconds of loading, popups might show up, media is loading in which makes the websites content jump around and change at the last second. For this reason I would advice you to avoid more general solutions like Selinium, and go for Puppeteer if possible. Selinium is for more broad applications, automating on multiple screen resolutions and browsers, while Puppeteer is a chromium based pre-packaged testing tool.

Languages

Choose a language you like the most because you will tune your scraper a lot. Reliability is important, just like with regular application development, except in this case breaking is inevitable. Sooner or later, you'll have to follow changes made by the target website's developers, so error handling and health checking is obligatory.

Python is the mainstream language for web scraping due to it's strong presence in big data and analytics. The best HTTP client of Python world is Requests, and for DOM parsing, use Beautiful Soup. For Python, you have to rely on Selinium when it comes to headless browsing.

Node.js is my preferred technology. It has a large pool of packages for HTTP requests, as well as DOM parsing. The HTTP client you should is Axios, and jsdom is excellent at making DOM accessible with familiar methods like document.querySelector. Other packages might be used for these tasks as well, and always abstract away these low level tasks, because you might have to switch technology to avoid detection(I'll explain these scenarios later).

For prototyping shell scripts might be enough. In Devtool > networking tab > right click on a package > Copy value > Copy as CUrl and you have a curl command to request that resource.

Proxy & VPN & TOR

Masking your real IP is a fundamental part of web scraper development, getting it off of blocklists is not easy, and it can give away your business goals even before your operation started.

For development, I prefer to use a VPN operating system wise. VPNs are providing a great number of - usually - good quality IP addresses you can use to disguise your activity and avoid blocking. For large number of requests, in production, we use proxies. From web scraping standpoint, the difference is performance. Switching IP address is quicker, and the latency is lower with proxies, but good quality proxy services are expensive.

Free proxies can be used, but look out for malicious actors! This repository is the best source of free proxies, just prepare to detect broken ones, and switch when necessary. Don't forget that you can download this list from the url:

https://raw.githubusercontent.com/saschazesiger/Free-Proxies/master/proxies/working.txt

with your HTTP client.

TOR is another free solution for IP masking, and the least productive one. TOR traffic is always picked up by revers proxies like Cloudflare, and challenged with captcha. It's use is justified only in rare circumstances.

Advanced

Legal ground

In general, web scraping is legal as long as it is done for legitimate purposes. However, web scraping can potentially be illegal if it is done to gain unauthorized access to someone else's data or to steal sensitive information. Additionally, some websites have terms of service that explicitly prohibit web scraping, so it is important to check a website's terms of service before starting to scrape it. It is also advisable to be mindful of the impact that your scraping activities may have on the website you are accessing, as excessive or improperly done web scraping can put a strain on the site's servers and bandwidth.

In summary, web scraping is generally legal as long as it is done for legitimate purposes, but it is important to consider any potential legal and ethical implications before starting to scrape a website. I can't give you legal advice, you should consult with a lawyer before starting your web scraping activity.

Http headers

Http headers are used to provide additional information about HTTP packages. Devtool's Network tab provides all sorts of information regarding HTTP headers, and it's often necessary to research and mimic them in your scraper.

There are standard, and non-standard HTTP headers, for this reason, inspecting http headers is part of the reverse engineering process. Unfortunately non-standard HTTP headers might be used by developers for pagination and filtering result sets.

Rate limits

Rate limit HTTP headers are used to limit rate at which a client can make requests to a server. They are typically used to protect servers from being overwhelmed by too many requests, and to ensure that all clients have a fair and reasonable access to the server's resources.

There are several different headers that can be used to implement rate limiting. Some commonly used headers includes:

X-Rate-Limit: This header specifies the maximum number of requests that a client is allowed to make within a specified time period.
X-Rate-Limit-Remaining: This header indicates the number of requests that a client has remaining before it reaches the rate limit.
X-Rate-Limit-Reset: This header specifies the time (in seconds) until the client's rate limit will be reset.

When you make too many requests to a server that is rate limited, the server will return a 429 Too Many Requests HTTP status code along with the rate limit headers. The client than can use the headers to determine how many requests it has remaining, and when it's rate limit will be reset, and can adjust it's behaviour accordingly. Honoring this rate limit is crucial to keep your operation "white hat".

From http header

Other steps can be taken to stay on legal ground, one of them is using the From HTTP header. This header shall contain contact information(email address), which can be used to reach you, in case your actions are provoking unwanted consequences, like overwhelming the server with too many requests.

AUP - Acceptable Use Policy

AUP is a set of guidelines that outlines the permitted and prohibited uses of a website or other online services. In the context of web scraping, an AUP may specify weather or not web scraping is allowed, and if so, under what conditions.

There are many different approaches that a website owner may take when it comes to web scraping. Some websites may allow web scraping under certain conditions, such as requiring that the scraper identify itself or limiting the rate in which it can make requests. Other websites may prohibit web scraping entirely, or may allow it with the express permission of the website owner.

It is important to web scrapers to respect the AUP of the websites that they are scraping, as failure do so could result in legal action being taken against the scraper's developer. If a websites AUP prohibits web scraping, it is generally best to avoid scraping the website altogether.

Cookies

Cookie is a special kind of HTTP header used to store and share data long-term between the client and server. At the first HTTP response the server might provide a Set-Cookie http header. It's value is stored by the browser, and later provided in the Ċookie HTTP request header. This data can be changed and red by the server and client too, most frequently used for tracking and authorization. Storing and sending cookie to the server is often necessary to make our scraper work.

CSRF

Cross-Site Request Forgery (CSRF) is an attack that forces an end user to execute unwanted actions on a web application in which they’re currently authenticated. With a little help of social engineering (such as sending a link via email or chat), an attacker may trick the users of a web application into executing actions of the attacker’s choosing. If the victim is a normal user, a successful CSRF attack can force the user to perform state changing requests like transferring funds, changing their email address, and so forth. If the victim is an administrative account, CSRF can compromise the entire web application. Source

Protecting users from this kind of attack is done by using a CSRF token. This token can be found in forms as a hidden input, and frequently in the <head>, as a <meta> tag. To evaluate weather your target is using CSRF token, check the HTTP headers of given package.

User agent

User-agent header contains information about your operating system and browser version.

Mozilla/5.0 (iPhone13,2; U; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/15E148 Safari/602.1

HTTP clients are quiet honest about this, but faking it is just as easy as overwriting User-agent header's value.

Revers proxy

A reverse proxy is a type of server that sits between a client and a server, forwarding requests from the client to the server and returning the server's responses back to the client. Reverse proxies are often used to improve the performance, security, and availability of a server or a group of servers. Revers proxies are hiding the real IP address of the backend, and can challenge any suspicious traffic.

Real IP behind reverse proxy

Finding the real IP address of the server might be impossible due to the fact that direct access might be blocked, or forbidden by terms of service, but it helps a lot by avoiding captcha. First thing is to check weather shodan.io is indexing your target. DNS history lookup is also a good option. There might be a time, when the domain name of the target was pointing to the original server.

TLS fingerprinting

TLS fingerprinting is typically done by analyzing various aspects of the SSL/TLS connections made by a device or software, such as the supported cipher suites, the order in which they are presented, the length of the SSL/TLS session IDs, and the format of the SSL/TLS certificates. By analyzing these characteristics, it is possible to create a unique fingerprint for each device or software. TLS fingerprints are paired with the User-agent header, and the reverse proxy challenges the request when it finds a mismatch. You can use a headless browser to avoid this detection.

Want more?

I'll release more content regarding web scraping as I experience more. To be notified, Subscribe to this Gist.

BenceBakos/web_scraping_concepts.md