I have a number of monthly manual tasks that I could not automate so far.
One of them is getting PDF from login-protected websites, saving them in specific folders with naming conventions (renaming etc.) and uploading those to my nextcloud.
The script below is for my Electricity Provider's PDFs. They are behind a simple login form (user & pw) and uploaded for the past 6 months. I always forget to check regularly enough to download all.
Two key takeaways:
- Use
selenium/standalone-chrome
Docker image
- mount outside download folder directly to the inside
selenium/standalone-chrome
default Download folder (seluser
), so we don't have to modify chrome's default for PDF download locations
docker run -d -p 127.0.0.1:4448:4444 \
--volume $(pwd)/download:/home/seluser/Downloads/
selenium/standalone-chrome
- Use wait times after selenium
.get()
for session/cookies to be set
This is my code, I think it can be adapted to comparable simple login pages, just replace the find_element parts:
import sys
import logging
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from random import randint
from time import sleep
from pathlib import Path
def list_downloads(driver):
if not driver.current_url.startswith("chrome://downloads"):
driver.get("chrome://downloads/")
return driver.execute_script("""
var items = document.querySelector('downloads-manager')
.shadowRoot.getElementById('downloadsList').items;
if (items.every(e => e.state === "COMPLETE"))
return items.map(e => e.fileUrl || e.file_url);
""")
# enable debug logging
root = logging.getLogger()
root.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)
logging.info('Creating folder.')
out_dir = Path.cwd() / "download"
out_dir.mkdir(exist_ok=True)
prefs = {'profile.default_content_settings.popups': 0,
'download.prompt_for_download': False,
'download.directory_upgrade': True,
'plugins.always_open_pdf_externally': True}
options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', prefs)
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
# see https://stackoverflow.com/a/73840130/4556479
options.add_argument("--headless=new")
options.add_argument('--disable-gpu')
options.add_argument('--disable-extensions')
logging.info('Connecting to Remote Chrome')
driver = webdriver.Remote(
"http://127.0.0.1:4448/wd/hub", options=options)
# Time to wait for element's presence
logging.info('Opening login page')
driver.implicitly_wait(5)
driver.get('https://sample.com/login.php')
# Sleep a random number of seconds
sleep(randint(2, 5))
# Click 'Accept cookies' button
logging.info('Accepting cookies..')
accept_cookis_button = driver.find_element(By.ID, 'cookieNotifyButton')
accept_cookis_button.click()
sleep(randint(2, 5))
logging.info('Sending username and password.')
username_input = driver.find_element(By.CSS_SELECTOR, 'input[name="pin"]')
password_input = driver.find_element(By.CSS_SELECTOR, 'input[name="passwd"]')
username_input.send_keys("xyz")
password_input.send_keys("xyz")
sleep(randint(2, 5))
logging.info('Logging in..')
login_button = driver.find_element(By.NAME, 'login0')
login_button.click()
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'lh2'))
)
except NoSuchElementException:
driver.quit()
# this seems necessary for the PDF to load
link = driver.find_element_by_link_text('Abruf Rechnungsdaten').click()
logging.info('Retrieving PDF')
driver.get('https://sample.com/get_pdf.php?value=-5')
# wait for download to finish
paths = WebDriverWait(driver, 12, 3).until(list_downloads)
print(paths)
driver.quit()
Sources:
list_downloads()
comes from a Stackoverflow answer- some of the general steps are adapted from shellhacks.com