IFILTER_INIT enumeration

Contains flags that control:

  • Mapping of nonprinting character codes
  • Property output
  • Embedding scope
  • IFilter access patterns

The Init method uses these flags to control the filtering process.

Syntax

typedef enum tagIFILTER_INIT { 
  IFILTER_INIT_CANON_PARAGRAPHS         = 1,
  IFILTER_INIT_HARD_LINE_BREAKS         = 2,
  IFILTER_INIT_CANON_HYPHENS            = 4,
  IFILTER_INIT_CANON_SPACES             = 8,
  IFILTER_INIT_APPLY_INDEX_ATTRIBUTES   = 16,
  IFILTER_INIT_APPLY_CRAWL_ATTRIBUTES   = 256,
  IFILTER_INIT_APPLY_OTHER_ATTRIBUTES   = 32,
  IFILTER_INIT_INDEXING_ONLY            = 64,
  IFILTER_INIT_SEARCH_LINKS             = 128,
  IFILTER_INIT_FILTER_OWNED_VALUE_OK    = 512,
  IFILTER_INIT_FILTER_AGGRESSIVE_BREAK  = 1024,
  IFILTER_INIT_DISABLED_EMBEDDED        = 2048,
  IFILTER_INIT_EMIT_FORMATTING          = 4096
} IFILTER_INIT;

Constants

  • IFILTER_INIT_CANON_PARAGRAPHS
    Paragraph breaks should be marked with the Unicode PARAGRAPH SEPARATOR (0x2029).

  • IFILTER_INIT_HARD_LINE_BREAKS
    Soft returns, such as the newline character in Word, should be replaced by hard returns?LINE SEPARATOR (0x2028). Existing hard returns can be doubled. A carriage return (0x000D), line feed (0x000A), or the carriage return and line feed in combination should be considered a hard return. The intent is to enable pattern-expression matching against observed line breaks.

  • IFILTER_INIT_CANON_HYPHENS
    Various word-processing programs have forms of hyphens that are not represented in the host character set, such as optional hyphens (appearing only at the end of a line) and nonbreaking hyphens. This flag indicates that optional hyphens are to be converted to nulls, and non-breaking hyphens are to be converted to normal hyphens (0x2010), or HYPHEN-MINUSES (0x002D).

  • IFILTER_INIT_CANON_SPACES
    All special space characters, such as nonbreaking spaces, are converted to the standard space character (0x0020).

  • IFILTER_INIT_APPLY_INDEX_ATTRIBUTES
    The client requires that text is split into chunks that represent internal value-type properties.

  • IFILTER_INIT_APPLY_CRAWL_ATTRIBUTES
    The client wants text split into chunks representing properties determined during the indexing process.

  • IFILTER_INIT_APPLY_OTHER_ATTRIBUTES
    Any properties not covered by the IFILTER_INIT_APPLY_INDEX_ATTRIBUTES and IFILTER_INIT_APPLY_CRAWL_ATTRIBUTES flags should be emitted.

  • IFILTER_INIT_INDEXING_ONLY
    The client calls the Init method only once, optimizing IFilter for indexing.

  • IFILTER_INIT_SEARCH_LINKS
    The text extraction process must recursively search all linked objects within the document. If a link is unavailable, the GetChunk call that would have obtained the first chunk of the link should return FILTER_E_LINK_UNAVAILABLE.

  • IFILTER_INIT_FILTER_OWNED_VALUE_OK
    The content indexing process can return property values set by the filter.

  • IFILTER_INIT_FILTER_AGGRESSIVE_BREAK
    Text should be broken in chunks more aggressively than normal.

  • IFILTER_INIT_DISABLED_EMBEDDED
    The IFilter should not return chunks from embedded content.

  • IFILTER_INIT_EMIT_FORMATTING
    The IFilter should emit formatting info.

Remarks

Generally, text output by the GetText method should match exactly the actual text of the document. However, to achieve maximum interoperability, some common features should be standardized. These features include paragraph breaks, line breaks, hyphens, and spaces. IFilter interface servers can also embed null characters in text, which are nearly ignored by clients. That is, Unicode character 0x0000 is completely ignored and 0x0001 is treated as a word break.

Four flags control text standardization: IFILTER_INIT_CANON_PARAGRAPHS, IFILTER_INIT_HARD_LINE_BREAKS, IFILTER_INIT_CANON_HYPHENS, and IFILTER_INIT_CANON_SPACES.

Different clients of the IFilter interface require different views of an object. Three flags, IFILTER_INIT_APPLY_INDEX_ATTRIBUTES, IFILTER_INIT_APPLY_CRAWL_ATTRIBUTES, and IFILTER_INIT_APPLY_OTHER_ATTRIBUTES, control the set of properties that should be applied to chunks. In addition, specific properties can be requested as an array of size cAttributes, stored in aAttributes in calls to the Init method.

IFilter interface implementations need to store some chunk information when operations other than content indexing occur. IFILTER_INIT_INDEXING_ONLY optimizes the filter for indexing.

For viewing purposes, it can be desirable to search across links and in the document and any objects embedded in it. IFILTER_INIT_SEARCH_LINKS specifies that all links are searched recursively.

Certain IFilter interface implementations might generate property values during the content indexing process, and IFILTER_INIT_FILTER_OWNED_VALUE_OK indicates that it is okay to return these values.

Requirements

Header

Filter.h