Extracting textual information from RTF documents

In order to provide indexing of office documents of a big company and to enable web-crawler to gather the necessary information the tool is required to get the textual information from non-textual sources, like PDF, DOCX, ODT, RTF, etc Another requirement is to use PHP without third-party tools, such as antiword, xpdf, or at least OLE under Windows. This requirement is grounded on the fact that, for example, OLE is incredibly slow, even is the task can be solved by it. Another reason is have the imndependent solution, not using any of existing tools and not to depend on the platfrom used. Here the task is to study Rich Text Format, which while evolution till the current 1.9.1 version has more than 300 pages of specifications, that are surely not heping in parsing this format.
Subscribe to rtf