Skip to content

Instantly share code, notes, and snippets.

Last active August 8, 2024 15:21
Show Gist options
  • Save mogsdad/e6795e438615d252584f to your computer and use it in GitHub Desktop.
Save mogsdad/e6795e438615d252584f to your computer and use it in GitHub Desktop.
For, a question about getting pdf attachments in gmail as text. I got a little carried away - this does much more than asked.

Google Apps Script pdfToText Utility#

This is a helper function that will convert a given PDF file blob into text, as well as offering options to save the original PDF, intermediate Google Doc, and/or final plain text files. Additionally, the language used for Optical Character Recognition (OCR) may be specified, defaulting to 'en' (English).

Note: Updated 12 May 2015 due to deprecation of DocsList. Thanks to Bruce McPherson for the getDriveFolderFromPath() utility.

    // Start with a Blob object
    var blob = gmailAttchment.getAs(MimeType.PDF);
    // fileId will be the ID of a saved text file (default behavior):
    var fileId = pdfToText( blob );

    // filetext will contain text from pdf file, no residual files are saved:
    var filetext = pdfToText( blob, {keepTextfile: false} );

    // we can save other converted file types, too:
    var options = {
       keepPdf : true,            // Keep a copy of the original PDF file.
       keepGdoc : true,           // Keep a copy of the OCR Google Doc file.
       keepTextfile : true,       // Keep a copy of the text file. (default)
       path : "attachments/today" // Folder path to store file(s) in.
    filetext = pdfToText( blob, options );
* Convert pdf file (blob) to a text file on Drive, using built-in OCR.
* By default, the text file will be placed in the root folder, with the same
* name as source pdf (but extension 'txt'). Options:
* keepPdf (boolean, default false) Keep a copy of the original PDF file.
* keepGdoc (boolean, default false) Keep a copy of the OCR Google Doc file.
* keepTextfile (boolean, default true) Keep a copy of the text file.
* path (string, default blank) Folder path to store file(s) in.
* ocrLanguage (ISO 639-1 code) Default 'en'.
* textResult (boolean, default false) If true and keepTextfile true, return
* string of text content. If keepTextfile
* is false, text content is returned without
* regard to this option. Otherwise, return
* id of textfile.
* @param {blob} pdfFile Blob containing pdf file
* @param {object} options (Optional) Object specifying handling details
* @returns {string} id of text file (default) or text content
function pdfToText ( pdfFile, options ) {
// Ensure Advanced Drive Service is enabled
try {
catch (e) {
throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
// Set default options
options = options || {};
options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;
// Prepare resource object for file creation
var parents = [];
if (options.path) {
parents.push( getDriveFolderFromPath (options.path) );
var pdfName = pdfFile.getName();
var resource = {
title: pdfName,
mimeType: pdfFile.getContentType(),
parents: parents
// Save PDF to Drive, if requested
if (options.keepPdf) {
var file = Drive.Files.insert(resource, pdfFile);
// Save PDF as GDOC
resource.title = pdfName.replace(/pdf$/, 'gdoc');
var insertOpts = {
ocr: true,
ocrLanguage: options.ocrLanguage || 'en'
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);
// Get text from GDOC
var gdocDoc = DocumentApp.openById(;
var text = gdocDoc.getBody().getText();
// We're done using the Gdoc. Unless requested to keepGdoc, delete it.
if (!options.keepGdoc) {
// Save text file, if requested
if (options.keepTextfile) {
resource.title = pdfName.replace(/pdf$/, 'txt');
resource.mimeType = MimeType.PLAIN_TEXT;
var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
var textFile = Drive.Files.insert(resource, textBlob);
// Return result of conversion
if (!options.keepTextfile || options.textResult) {
return text;
else {
// Helper utility from
function getDriveFolderFromPath (path) {
return (path || "/").split("/").reduce ( function(prev,current) {
if (prev && current) {
var fldrs = prev.getFoldersByName(current);
return fldrs.hasNext() ? : null;
else {
return current ? null : prev;
Copy link

ChaIcuWo commented Sep 4, 2015

Generally this works really well for my purposes, but I've had a repeating issue with some PDFs where the text returned is not from the entire document. In the most recent instance this is a 19 page PDF and only the first 10 pages are returned. This repeats even if I randomly re-arrange the pages in the document. Is this a known issue/limitation?

EDIT: I found this immediately after posting this... Seems we're stuck with this

Copy link

I've been using this for a few months but yesterday it just stopped working completely, is anyone else having this problem?

It fails on line
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);

Just says internal error. I've also noticed that I can't right click a pdf in Drive and go to "open with > google docs" as that just errors too. I hope this gets fixed as I use this a lot.

Copy link

Same issue here.... Also internal error on the same line.

Copy link

I logged it as an issue with google : As far as I see it is a bug with the google dudes.

Copy link

Hi mogsdad,

I currently use a script to download emailed pdf docs and save them to drive. Would you be interested in combining the two scripts or adding on to either? I am interested in engaging someone to code various scripts.


Copy link

@dan23njguy i currently also use another script to save attachments from gmail to drive and was looking for a way to search text inside pdfs and found this. Would you be interested in working together ?

Copy link

2wice2 commented Mar 19, 2018

Is it possible to rename PDF from the content of PDF?

Copy link

Armsp0 commented May 24, 2018

I try to debug/run this code and I get the error "TypeError: Cannot call method "getName" of undefined etc"
What is wrong here? obviously I am a novice....

Copy link

vladox commented Apr 10, 2021

I'll recommend using pdftotext from the poppler package

Copy link

Hi David, by any chance, is there a license for this gist? I would like to use it to one of my projects but I want to make sure that I am not infringing on your copyright

Copy link

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

Copy link

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

Hi, Actually it is referring main drive (Drive.files), you need to replace it with "DriveApp.getFolderById('string_id_of_my_folder');"

You may visit us for more at help

Copy link

Hi, My files are saving in the "My Drive" not in my desired folder, even though I put right address in path : "Attachments/Test" which exists. Can anyone help me to solve this issue?

Hi, Actually it is referring main drive (Drive.files), you need to replace it with "DriveApp.getFolderById('string_id_of_my_folder');"

You may visit us for more at help

It's giving error, "TypeError: DriveApp.getFolderById(...).insert is not a function".

Copy link

thokoe commented Oct 20, 2022

Hi, Is it possible to run the script without having to save the google doc to drive and then delete it.

Copy link

appscriptexpert commented Oct 20, 2022 via email

Copy link

dahse89 commented Apr 13, 2023

The Script wasn't working for me, but i found this

 * Convert PDF file to text
 * @param {string} fileId - The Google Drive ID of the PDF
 * @param {string} language - The language of the PDF text to use for OCR
 * return {string} - The extracted text of the PDF file

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;

Copy link

The Script wasn't working for me, but i found this

 * Convert PDF file to text
 * @param {string} fileId - The Google Drive ID of the PDF
 * @param {string} language - The language of the PDF text to use for OCR
 * return {string} - The extracted text of the PDF file

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;

how abaut get Image ?, when I add script

const ImgContent = DocumentApp.openById(id).getBody().getImage();

I cannot get all PDF file, there are 3 images but 2 images detected only

Copy link

The Drive.Files.insert api is outdated, now it needs a ParentReference on parents field, and the request is always uploading to the root folder

The if (options.path) must be replaced by

  if (options.path) {
    const folder = getDriveFolderFromPath (options.path);
    if (folder) {
      const parentReference = Drive.newParentReference(); = folder.getId();

Copy link

I'm using this function that works well:

 * @param {string} fileId
 * @param {string} parentFolderId
 * @returns {string} pdfContent
function extractTextFromPDF(fileId, parentFolderId) {

  const destFolder = Drive.Files.get(parentFolderId, { "supportsAllDrives": true });
  const newFile = {
    "fileId": fileId,
    "parents": [
  const args = {
    "resource": {
      "parents": [
      "name": "temp",
      "mimeType": "application/",
    "supportsAllDrives": true

  const newTargetDoc = Drive.Files.copy(newFile, fileId, args);
  const newTargetFile = DocumentApp.openById(newTargetDoc.getId());
  const pdfContent = newTargetFile.getBody().getText();

  return pdfContent;

Copy link

I take "insert is not a function" error. I fixed with this code. You can read multiple pdf files.

const convertPDFToText = (pdfDocument) => {
  try {

    // Use OCR to convert PDF to a temporary Google Document
    const fileMetadata = {
      name: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: 'application/' // Ensuring the target MIME type is Google Docs

    const media = pdfDocument.getBlob();

    const options = {
      ocr: true,
      ocrLanguage: "en",
      fields: 'id, name'

    const response = Drive.Files.create(fileMetadata, media, options);
    const { id, name } = response;

    // Add a delay to ensure the document is fully processed
    Utilities.sleep(10000); // 10 seconds

    // Verify the document exists and is accessible
    const tempFile = DriveApp.getFileById(id);
    // Check if the file is a Google Document
    const mimeType = tempFile.getMimeType();
    // Ensure the file is a Google Document
    if (mimeType !== MimeType.GOOGLE_DOCS) {
      throw new Error(`Unexpected MIME type: ${mimeType}`);

    // Use the Document API to extract text from the Google Document
    const doc = DocumentApp.openById(id);
    const body = doc.getBody();

    // Check if the document body is empty
    if (!body || !body.getText()) {
      throw new Error('Document body is empty or not accessible');

    const textContent = body.getText();
    // Delete the temporary Google Document since it is no longer needed
    return textContent;
  } catch (error) {
    Logger.log(`Error: ${error.message}`);
    throw error;

const convertPDFsInFolderToText = (folderId) => {
  var folder = DriveApp.getFolderById(folderId);
  var files = folder.getFiles();
  var allTextContent = "";

  while (files.hasNext()) {
    var pdfFile =;
    try {
      var textContent = convertPDFToText(pdfFile);
      allTextContent += textContent;

    } catch (error) {
      Logger.log(`Failed to process file ${pdfFile.getName()}: ${error.message}`);
        return allTextContent;

function Run(){
  const folderId = ""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment