2011年1月25日 星期二

擷取PDF檔內圖片

使用軟體的話,有以下工具

http://opensecrets.pixnet.net/blog/post/27841494

http://azo-freeware.blogspot.com/2008/08/some-pdf-image-extract-14.html

使用linux/perl的話,參考下列連結


使用手寫程式的話,如果是jpg檔,以這個python script而言很簡單

http://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html

否則就要參考一些資料了

http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python

http://www.jpedal.org/PDFblog/2010/04/understanding-the-pdf-file-format-how-are-images-stored/

這也證實了,如果不是單純的jpg圖檔的話,"擷取PDF檔內圖片"這件工作可能會很麻煩

涉及中文的話,可參考以下連結

http://ccckmit.wikidot.com/pdf:streamcoding

原版的pdf規格

http://partners.adobe.com/public/developer/en/pdf/PDFReference16.pdf

簡明的pdf檔格式的說明:

http://www.mactech.com/articles/mactech/Vol.15/15.09/PDFIntro/

節錄其中重點如下


b  closepath, fill,and stroke path.
B  fill and stroke path.
b*  closepath, eofill,and stroke path.
B*  eofill and stroke path.
BI  begin image.
BMC  begin marked content.
BT  begin text object.
BX  begin section allowing undefined operators.
c  curveto.
cm  concat. Concatenates the matrix to the current transform.
cs  setcolorspace for fill.
CS  setcolorspace for stroke.
d  setdash.
Do  execute the named XObject.
DP  mark a place in the content stream, with a dictionary.
EI  end image.
EMC  end marked content.
ET  end text object.
EX  end section that allows undefined operators.
f  fill path.
f*  eofill Even/odd fill path.
g  setgray (fill).
G  setgray (stroke).
gs  set parameters in the extended graphics state.
h  closepath.
i setflat.
ID  begin image data.
j  setlinejoin.
J  setlinecap.
k  setcmykcolor (fill).
K  setcmykcolor (stroke).
l  lineto.
m  moveto.
M  setmiterlimit.
n  end path without fill or stroke.
q  save graphics state.
Q  restore graphics state.
re  rectangle.
rg  setrgbcolor (fill).
RG  setrgbcolor (stroke).
s  closepath and stroke path.
S  stroke path.
sc  setcolor (fill).
SC  setcolor (stroke).
sh  shfill (shaded fill).
Tc  set character spacing.
Td  move text current point.
TD  move text current point and set leading.
Tf  set font name and size.
Tj  show text.
TJ  show text, allowing individual character positioning.
TL  set leading.
Tm  set text matrix.
Tr  set text rendering mode.
Ts  set super/subscripting text rise.
Tw set word spacing.
Tz  set horizontal scaling.
T*  move to start of next line.
v  curveto.
w  setlinewidth.
W  clip.
y  curveto.

TABLE 1: PDF Page Markup Operators
(Note: Equivalent PostScript operators are in boldface.)

沒有留言:

張貼留言