Python如何实现从PDF文件中爬取表格数据（代码示例）(2)-木庄网络博客

为了提取整个页面中唯一的表格，我们需要定位表格所在的位置。PDF文件的坐标系统与图片不一样，它以左下角的顶点为原点，向右为x轴，向上为y轴，可以通过以下Python代码输出整个页面的文字的坐标情况：

import camelot
 
# 从PDF中提取表格
tables = camelot.read_pdf('G://Statistics-Fundamentals-Succinctly.pdf', pages='53', \
                          flavor='stream')
 
# 绘制PDF文档的坐标，定位表格所在的位置
tables[0].plot('text')

输出结果为：

1	`UserWarning: No tables found on page-53 [stream.py:292]`

整个代码没有找到表格，这是因为stream方法默认将整个PDF页面当作表格，因此就没有找到表格。但是绘制的页面坐标的图像如下：

仔细对比之前的PDF页面，我们不难发现，表格对应的区域的左上角坐标为（50,620），右下角的坐标为（500,540）。我们在read_pdf()函数中加入table_area参数，完整的Python代码如下：

import camelot
 
# 识别指定区域中的表格数据
tables = camelot.read_pdf('G://Statistics-Fundamentals-Succinctly.pdf', pages='53', \
                          flavor='stream', table_area=['50,620,500,540'])
 
# 绘制PDF文档的坐标，定位表格所在的位置
table_df = tables[0].df
 
print(type(table_df))
print(table_df.head(n=6))

输出的结果为：

<class 'pandas.core.frame.DataFrame'>
         0               1                2           3
0  Student  Pre-test score  Post-test score  Difference
1        1              70               73           3
2        2              64               65           1
3        3              69               63          -6
4        …               …                …           …
5       34              82               88           6