How to Delete the First Page for Thousands of PDF Documents

by Luke Muehlhauser on April 6, 2010 in How-To

This is a companion post to Software for Philosophers.

The familiar JSTOR PDF front page.

The familiar JSTOR PDF front page.

If you download thousands of journal articles like I do, you may want to remove the extra page added to the beginning of each article by article repositories like JSTOR.

You can do this in Adobe Acrobat Pro with Document -> Delete Pages, but trust me; you don’t want to do that manually for 500 PDFs.

The solution is automation. First, get each PDF with an unneeded first page into a certain folder called ‘RemoveFirstPage’. This is easiest if you have PDF thumbnails enabled in Windows Explorer, or if you use Adobe Bridge, so that you can easily see that each file you put in the RemoveFirstPage folder has a first page you want to remove. You don’t want to accidentally put a normal PDF in this folder, because then you will remove from it a first page of actual content!

Once you have these PDFs in the same folder, it’s time to write the automation script. First I’ll explain how to do it in Adobe Acrobat Pro, and below I’ll explain how to do it with a free tool called pdftk.

With Adobe Acrobat Pro

With Acrobat Pro and click Advanced -> Document Processing -> Batch Processing. Click New Sequence and type the name: ‘Delete First Page’.

Then click Select Commands. On the left, scroll down to Execute Javascript and click Add. Click the little + symbol to expand that command, and click where it says “Script”. Then click Edit and replace what is there with the following code:

/* Delete First Page */
this.deletePages(0);

The first line just gives the Javascript a name. The second line deletes the first page (which for Acrobat means Page 0, not Page 1).

The little grey box next to ‘Execute Javascript’ should be empty; if it’s not, click it to toggle it. This will make it so that when you execute the sequence on hundreds of files, you won’t have to click OK for each one.

edit sequenceClick OK in the Edit Sequence window. I recommend you set ‘Run commands on’ set to ‘Ask When Sequence is Run’, and do the same with ‘Select output location.’

Click OK and you’ll be brought back to the list of Batch Sequences. Select your new ‘Delete First Page’ sequence and click Run Sequence. It will show you again the command that will run, so click OK.

You’ll then be prompted to select the files you want to run the process on. Navigate to the special folder where you put all your PDFs that need their first pages removed and select all of them, then click Select. If you chose lots of PDFs, you’ll have to wait a while at this point.

Eventually, it will ask you for the output location. Choose your destination and click OK. The process will begin immediately. My computer can do this to 100 PDFs in about 5 minutes.

With free software: pdftk

You can also do this with a free program called pdftk, available in Windows, Mac, and Linux. Download it here.

Extract pdftk.exe to your C: drive. Then click Start -> Run and type cmd to get a command line window. From there, type cd.. and hit enter. Do this a few times until your prompt says just C: and then type the following command:

pdftk in.pdf cat 2-end output out.pdf

Replace in.pdf with the location and name of your source PDF, and replace out.pdf with the location and name of your destination PDF.

If you’re in Linux, you can use the following commands to take every PDF in the current directory and copy them to the ‘trimmed’ directory with the first page removed:

mkdir trimmed
for i in *pdf ; do pdftk “$i” cat 2-end output “trimmed/$i” ; done

Good luck.

Previous post:

Next post:

{ 18 comments… read them below or add one }

blokhead April 6, 2010 at 8:46 pm

Here is a free way to do it using the very handy pdftk tool, which is available on many platforms. To trim the first page of a single PDF file:

pdftk in.pdf cat 2-end output out.pdf

In linux from a bash command line, you can easily automate it like so:

mkdir trimmed
for i in *pdf ; do pdftk “$i” cat 2-end output “trimmed/$i” ; done

This takes every pdf in the current directory and creates a copy in the “trimmed” directory with the first page removed.

pdftk also does about a million other things with PDF files, too. Truncating, rotating, concatenating, etc..

  (Quote)

lukeprog April 6, 2010 at 9:04 pm

blokhead,

Thanks, I’ve updated the post. Any way to make pdftk for Windows do the Linux thing: process an entire folder rather than a single file?

  (Quote)

lukeprog April 6, 2010 at 9:25 pm

With a Windows batch file, I suppose…

  (Quote)

Chris Hallquist April 6, 2010 at 9:41 pm

Speaking of PDFs, how do you get PDFs of journal articles in the first place? (I’ve wondered about how I’ll do this if I land outside of academia at some point.)

Also: does anyone have the PDF of Craig’s debate with Ehrman? It seems to have gone missing from the website it used to be on.

  (Quote)

TaiChi April 6, 2010 at 10:02 pm

http://www.holycross.edu/assets/pdfs/resurrection_debate.pdf

Is this it, Chris? I sometimes use http://www.pdf-search-engine.com/ to find stuff like that.

  (Quote)

blokhead April 6, 2010 at 10:03 pm

With a Windows batch file, I suppose…  

Yes, a batch file would work. But it’s been many moons since I’ve written a Windows batch file, and don’t know the appropriate syntax off the top of my head.

Another possible solution is to set up a “print to PDF” virtual print queue, then use any PDF viewer to print pages 2-end of the original PDF to this print queue. But this solution may not be amenable to automation, and I don’t know the ins and outs of setting up a “print to PDF” queue.

  (Quote)

blokhead April 6, 2010 at 10:10 pm

Speaking of PDFs, how do you get PDFs of journal articles in the first place? (I’ve wondered about how I’ll do this if I land outside of academia at some point.)

Google scholar is what I often use. Fortunately in my discipline, many authors post preliminary or full versions of their articles on their websites, and Google shows these under the “see all versions” link. I imagine this might be less common in the humanities…

  (Quote)

lukeprog April 6, 2010 at 10:26 pm
Rich Griese April 7, 2010 at 1:21 am

This may be a stupid question… but every time I get directed to some “journal”, I find they require payment. I am not associated with any university, but am just a private citizen. I am interested in reading as much journal stuff as I can. If anyone know of some secret list of free journals that are not supernaturalism oriented, but are history oriented on early christian history, I would love it if they would consider emailing me such a list.

Cheers! RichGriese@gmail.com

  (Quote)

Hermes April 7, 2010 at 4:18 am

Linux & pdftk

Pdftk is Java based and available through most package managers. There is no need to go to the web site.

While you are in there, why not look for other PDF tools? There are dozens of them — both graphical and script able if not both — allowing you to set up any workflow you want.

  (Quote)

lukeprog April 7, 2010 at 6:16 am

Rich,

See my post ‘How to Get Academic Papers for Free’. I am not associated with any university either, but I have more academic papers on my hard drive than anyone I know.

  (Quote)

cl April 7, 2010 at 9:45 am

Hey this is off-topic but I didn’t want to spend too much time searching for the posts where you ask for recommendations for your new site design.

One thing I think might prove helpful for people who search the site would be if the link to older posts was up top. That way, a person doesn’t have to get to the bottom of each page to scroll to the previous page.

  (Quote)

Chris Hallquist April 7, 2010 at 11:44 am

@TaiChi: That’s it. I thought they had taken the PDF down, but it looks like they just changed the link. Luke, you need to change that on your 500+ debates page.

Also, I had no idea there’s a PDF search engine. That’s awesome.

@Luke: D’oh. Forgot you already wrote about that.

  (Quote)

Jake de Backer April 8, 2010 at 7:44 pm

If anyone is looking to exchange academic articles, book reviews, debates, books, etc. please email me @ iamjakeurnot@aol.com. I have several thousand and am always looking for more!

J.

  (Quote)

lukeprog April 8, 2010 at 8:18 pm

I can vouch for Jake’s collection and his generosity. :)

  (Quote)

ColonelFazackerley April 23, 2010 at 3:54 am

re: batch file
I have not tested this, but…
Try putting this in a batch file. put the batch file in your “in” directory. have an “out” directory that has a
common parent directory with “in”. backup your data before trying it for the first time!
It may well screw up if you have spaces in your filenames. then you can try putting quotes around the paths. I put an echo in front of the two indented lines to check, so I think this should work. Let me know how you get on.

@echo off
for /F %%D in (‘dir /B *.pdf’) do (
pdftk %%D cat 2-end output ..\out\%%D
rm %%D
)

  (Quote)

zerodegrees December 30, 2010 at 12:08 am

The bash script:

for i in *pdf ; do pdftk “$i” cat 2-end output “trimmed/$i” ; done

doesn’t work as it is (at least not for my Ubuntu Linux flavor). But this does

for i in *pdf ; do pdftk “$i” cat 2-end output ./trimmed/”$i” ; done

which is by the way syntactically as it should be.

Cheers.

  (Quote)

Luke Muehlhauser December 30, 2010 at 7:11 pm

Thanks, zerodegrees!

  (Quote)

Leave a Comment