从pdf 文件中抽取特定的页面

前段时间买了一个kindle 电子书阅读器、我想用它来读的pdf文档、当然最主要是用来读python标准库&mysql的官方文档。

问题就来了、这两个都是大头书、之前用mac看还好、用kindle就真的不方便了；主要是kindle对pdf的支持不太好、不能

目录导航；于是我就想把大的pdf文件按章节分解成小的pdf文件

一、安装PyPDF2这个python包：

pip3 install PyPDF2

二、从源pdf文件中抽取页面：

#/usr/local/python/bin/python3

from PyPDF2 import PdfFileReader,PdfFileWriter

"""

抽取pdf页面

"""

if __name__=="__main__":

    reader=PdfFileReader('/Users/jianglexing/Documents/linux/python/python-3.6/library.pdf')

    writer=PdfFileWriter()

    #开始的页面号

    start=108

    #结束的页面号

    stop=126

    with open('/Users/jianglexing/Documents/python-std-re.pdf','wb') as wstream:

        for page in range(start,stop):

            temp=reader.getPage(page)

            writer.addPage(temp)

        writer.write(wstream)

    print("对抽取完成了")

三、功能我们已经实现了、但是还太友好、下面对代码进行改进：

#/usr/local/python/bin/python3

from PyPDF2 import PdfFileReader,PdfFileWriter

import argparse

"""

抽取pdf页面

"""

if __name__=="__main__":

    parser=argparse.ArgumentParser()

    parser.add_argument('--source-file',default=r'/Users/jianglexing/Documents/linux/python/python-3.6/library.pdf',help='源文件全路径')

    parser.add_argument('--target-file',default=r'/tmp/target.pdf',help='目标路径的全路径')

    parser.add_argument('--start-page',default=,type=int,help='开始的页号')

    parser.add_argument('--stop-page',default=,type=int,help='结束的页号')

    args=parser.parse_args()

    reader=PdfFileReader(args.source_file)

    writer=PdfFileWriter()

    with open(args.target_file,'wb') as wstream:

        for page in range(args.start_page,args.stop_page):

            temp=reader.getPage(page)

            writer.addPage(temp)

        writer.write(wstream)

    print("对抽取完成了")

四、还有一些没有解决的问题、如果源文件太大的话会报错、由于还没有看PyPDF2的源码、所以目前还不知道怎么解决：

JianglexingdeMacBook-Pro:Desktop jianglexing$ python3 splitpdf.py --source-file='/Users/jianglexing/Desktop/refman-5.7.18-en.a4.pdf' --target-file=/Users/jianglexing/Desktop/temp.pdf --start-page= --stop-page=

Traceback (most recent call last):

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/generic.py", line , in __new__

    return decimal.Decimal.__new__(cls, utils.str_(value), context)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/utils.py", line , in str_

    if sys.version_info[] < :

RecursionError: maximum recursion depth exceeded in comparison

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "splitpdf.py", line , in <module>

    writer.write(wstream)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line , in write

    self._sweepIndirectReferences(externalReferenceMap, self._root)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line , in _sweepIndirectReferences

    self._sweepIndirectReferences(externMap, realdata)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line , in _sweepIndirectReferences

    value = self._sweepIndirectReferences(externMap, value)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line , in _sweepIndirectReferences

    self._sweepIndirectReferences(externMap, realdata)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line , in _sweepIndirectReferences

    value = self._sweepIndirectReferences(externMap, value)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line , in _sweepIndirectReferences

    value = self._sweepIndirectReferences(externMap, data[i])

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line , in _sweepIndirectReferences

    self._sweepIndirectReferences(externMap, realdata)

----

学习交流

从pdf 文件中抽取特定的页面

秒客网

从pdf 文件中抽取特定的页面

相关文章