Skip to content

Instantly share code, notes, and snippets.

Created March 14, 2013 11:48
Show Gist options
  • Save elssar/5160757 to your computer and use it in GitHub Desktop.
Save elssar/5160757 to your computer and use it in GitHub Desktop.
Download all the pdf files linked in a given webpage.
#!/usr/bin/env python
Download all the pdfs linked on a given webpage
Usage -
python url <path/to/directory>
url is required
path is optional. Path needs to be absolute
will save in the current directory if no path is given
will save in the current directory if given path does not exist
Requires - requests >= 1.0.4
beautifulsoup >= 4.0.0
Download and install using
pip install requests
pip install beautifulsoup4
__author__= 'elssar <>'
__license__= 'MIT'
__version__= '1.0.0'
from requests import get
from urlparse import urljoin
from os import path, getcwd
from bs4 import BeautifulSoup as soup
from sys import argv
def get_page(base_url):
req= get(base_url)
if req.status_code==200:
return req.text
raise Exception('Error {0}'.format(req.status_code))
def get_all_links(html):
bs= soup(html)
links= bs.findAll('a')
return links
def get_pdf(base_url, base_dir):
html= get_page()
links= get_all_links(html)
if len(links)==0:
raise Exception('No links found on the webpage')
n_pdfs= 0
for link in links:
if link['href'][-4:]=='.pdf':
n_pdfs+= 1
content= get(urljoin(base_url, link['href']))
if content.status==200 and content.headers['content-type']=='application/pdf':
with open(path.join(base_dir, link.text+'.pdf'), 'wb') as pdf:
if n_pdfs==0:
raise Exception('No pdfs found on the page')
print "{0} pdfs downloaded and saved in {1}".format(n_pdfs, base_dir)
if __name__=='__main__':
if len(argv) not in (2, 3):
print 'Error! Invalid arguments'
print __doc__
arg= ''
url= argv[1]
if len(argv)==3:
arg= argv[2]
base_dir= [getcwd(), arg][path.isdir(arg)]
except Exception, e:
print e
Copy link

xorn commented Nov 3, 2016

I get pdf_get() requires exactly 2 args

Copy link

Me too.

Copy link

Ompha commented Apr 3, 2017

I tried the following modification which solved the problem "pdf_get() requires exactly 2 args":
Change line 41 to html= get_page(base_url)
Change line 68 to get_pdf(url ,base_dir)

However, the script gives new error "An exception has occurred, use %tb to see the full traceback.
SystemExit: -1".
I traced back the error but cannot find a solution to get this working.
Helps will be appreciated. Thanks.

Copy link

Nice Code, Worked like a charm! Couple of tweaks and i was able to download all the pdf files.

Copy link

I also get the "pdf_get() requires exactly 2 args" error whatever I do.

Copy link

Felipe-UnB commented Nov 15, 2017

If I got it right, the point about the "2 args" error would be an approach to test if both arguments, base_url and base_dir, were present in the call of the function? But it is strange, Python would immediately rise an exception if we try to run this code without providing the arguments. I did some modifications to this code and it is running.

Copy link

The easiest solution to this is to just use the wget command on the terminal
For example:
wget -r -P ./pdfs -A pdf

Copy link

@danny311296 Your code returns an error

>>> wget -r -P ./pdfs -A pdf
  File "<stdin>", line 1
    wget -r -P ./pdfs -A pdf
SyntaxError: invalid syntax

Copy link

dannyi96 commented Aug 25, 2018

@Adisain It should work on Ubuntu and most Unix systems.

Maybe try
wget -r -P pdfs -A pdf
instead on other systems

Copy link

jQwotos commented Apr 18, 2019

Thanks @danny311296

Copy link

@danny, the command works amazingly for the above website, Please for the website it is throwing cannot verify certificate error, so should i try python codes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment