一、要使用 Python 提取 PDF 文件的标题、日期和内容并将其存储到 MySQL 数据库中,您可以按照以下步骤操作:
1.安装必要的库:pdfminer, PyPDF2, mysql-connector-python.pip install pdfminer PyPDF2 mysql-connector-python
2.导入必要的库并连接到 MySQL 数据库。
import mysql.connector
from mysql.connector import Error
from mysql.connector import errorcode
import PyPDF2
from pdfminer.high_level import extract_text
try:
connection = mysql.connector.connect(host='localhost',
database='database_name',
user='username',
password='password')
if connection.is_connected():
cursor = connection.cursor()
print("Connected to MySQL database")
except Error as e:
print("Error while connecting to MySQL", e)
3.打开 PDF 文件并提取其标题、日期和内容。
pdf_file = open('', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
title = pdf_reader.documentInfo.title
date = pdf_reader.documentInfo['/CreationDate']
content = extract_text('')
4.将提取的信息插入到 MySQL 数据库中。
try:
cursor.execute("INSERT INTO table_name (title, date, content) VALUES (%s, %s, %s)", (title, date, content))
connection.commit()
print("Record inserted successfully into MySQL database")
except mysql.connector.Error as error:
print("Failed to insert record into MySQL database {}".format(error))
finally:
if connection.is_connected():
cursor.close()
connection.close()
print("MySQL connection is closed")
请注意,您需要将database_name、username、password和替换table_name为您自己的数据库信息。此外,请确保 PDF 文件与 python 脚本位于同一目录中,或者指定文件的完整路径。
二、详例解析
1.假定文本内容
Title: Sample PDF Document
Date: 2022-03-20
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed at est at lectus viverra malesuada. Pellentesque fermentum dolor vel finibus consequat. Nulla facilisi.
2.创建数据表存储PDF数据
CREATE TABLE pdf_data (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255),
date DATE,
content TEXT
);
3.编写Python代码将其解析存入数据库中
import PyPDF2
from datetime import datetime
import mysql.connector
# Open the PDF file
pdf_file = open('', 'rb')
# Read the PDF metadata
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
pdf_info = pdf_reader.getDocumentInfo()
title = pdf_info.title
# Read the PDF content
content = ''
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
content += page.extractText()
# Format the date
date_str = pdf_info.get('CreationDate')[2:10]
date = datetime.strptime(date_str, '%Y%m%d').date()
# Store the data in the MySQL database
cnx = mysql.connector.connect(user='username', password='password', host='localhost', database='pdf_db')
cursor = cnx.cursor()
add_pdf = ("INSERT INTO pdf_data (title, date, content) VALUES (%s, %s, %s)")
pdf_data = (title, date, content)
cursor.execute(add_pdf, pdf_data)
cnx.commit()
# Close the database connection and PDF file
cursor.close()
cnx.close()
pdf_file.close()
4.插入成功后在数据库库中查询
SELECT * FROM pdf_data;
大致结果如下:
id | title | date | content |
---|---|---|---|
1 | Sample PDF Document | 2022-03-20 | Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed at est at lectus viverra malesuada. Pellentesque fermentum dolor vel finibus consequat. Nulla facilisi. |