sarthakpranesh/hcpProject.md

Created February 16, 2021 09:26

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/sarthakpranesh/5b89ea74c557abb1f55821e18d9241de.js"></script>
Save sarthakpranesh/5b89ea74c557abb1f55821e18d9241de to your computer and use it in GitHub Desktop.

Download ZIP

Parallel Soup

Raw

hcpProject.md

High Performance Library to parallelize BeautifulSoup

A library that wrappes BeautifulSoup to provide multi threaded scrapping, reducing the total time involved in the scrapping process. The library should implement the following affectively: (this list can be extended in future)

Parallelization
Should have a generic interface that maps to beautiful soup
All parts of the library should be documented heavily
All parts of the library should have unit tests written for verification of their functionality
Showcase written examples for different sorts of scrapping

Some Resources

Author

sarthakpranesh commented Feb 22, 2021

Some papers:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment