设为首页收藏本站

LUPA开源社区

 找回密码
 注册
文章 帖子 博客
LUPA开源社区 首页 业界资讯 软件追踪 查看内容

Apache Tika 1.14发布 ,内容抽取工具集合

2016-11-13 14:41| 发布者: joejoe0332| 查看: 877| 评论: 0|原作者: oschina|来自: oschina

摘要: Apache Tika 1.14 发布了,该版本包含了一些改进和 Bug 修复。Tika 是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了 POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika 也提供了 ...

Apache Tika 1.14 发布了,该版本包含了一些改进和 Bug 修复。Tika 是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了 POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika 也提供了便利的扩展 API,用来丰富其对第三方文件格式的支持。

更新如下:

  • Extract all headers from MSG/RFC822 (TIKA-2122).

  • Upgrade metadata-extractor to 2.9.1 (TIKA-2113).

  • Extract PDF DocInfo metadata into separate keys to prevent overwriting by XMP metadata (TIKA-2057).

  • Re-enable fileUrl for tika-server (TIKA-2081).  If you choose,to use this feature, beware of the security vulnerabilities!See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271

  • Add Tesseract's hOCR output format as an option, via Eric Pugh(TIKA-2093)

  • Extract macros from MSOffice files (TIKA-2069).

  • Maintain passed-in mime in TXTParser (TIKA-2047).

  • Upgrade to POI.3-15 (TIKA-2013).

  • Upgrade to PDFBox 2.0.3 (TIKA-2051).

  • Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255 and TIKA-2078)

  • Tika now is integrated with the Tensorflow library from Google and it can use its Inception v3 image classification model to identify objects in images (TIKA-1993).

  • Parser configuration is now type-safe and parameters for parsers can have assigned types (TIKA-1508, TIKA-1986).

  • Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040).

  • Upgrade ICU4J charset detection components to fix multithreading bug (TIKA-2041).

  • Upgrade to Jackcess 2.1.4 (TIKA-2039).

  • Maintain more significant digits in cells of "General" format in XLS and XLSX (TIKA-2025).

  • Avoid mark/reset issues when extracting or detecting embedded resources in RFC822 emails (TIKA-2037).

  • Improving accuracy of Tesseract for better extraction of numeric and alphanumeric text from images (TIKA-2021, TIKA-2031).

  • Improve extraction of embedded documents from PPT, PPTX and XLSX(TIKA-2026).

  • Add parser for applefile (AppleSingle) (TIKA-2022).

  • Add mime types, mime magic and/or globs for:

    • Endnote Import File (TIKA-2011)

    • DJVU files (TIKA-2009)

    • MS Owner File (TIKA-2008)

    • Windows Media Metafile (TIKA-2004)

    • iCal and vCalendar (TIKA-2006)

    • MBOX (TIKA-2042)

    • Stata DTA (TIKA-2064)

  • Add configurable maximum threshold for number of events extracted from the XMP Media Management Schema in JempboxExtractor (TIKA-1999).

  • Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994).

  • Add mime detection via Nick C and parser for DBF files (TIKA-1513).

  • Add mime detection and parsers for MSOffice 2003 XML Word and Excel formats (TIKA-1958).

  • Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454).

  • Upgrade to Commons Compress 1.12 (supports progress on TIKA-1358)

发布说明完整更新内容

下载地址:


酷毙

雷人

鲜花

鸡蛋

漂亮
  • 快毕业了,没工作经验,
    找份工作好难啊?
    赶紧去人才芯片公司磨练吧!!

最新评论

关于LUPA|人才芯片工程|人才招聘|LUPA认证|LUPA教育|LUPA开源社区 ( 浙B2-20090187 浙公网安备 33010602006705号   

返回顶部