Text Pre-Extraction in AEM : Comprehensive Guide

Text pre-extraction in AEM is very useful and highly recommended for re/indexing Lucene indexes on repositories with large binaries that contain extractable text (eg. PDFs, Word Docs, PPTs, TXT, etc.). Running re-indexing directly on lucene indexes is very expensive and may cause performance issues. After completing this tutorial you will be able to understand:-

  • Text pre-extraction overview.
  • When to use text pre-extraction in AEM.
  • When not to use text pre-extraction in AEM.
  • Prerequisites for using text pre-extraction.
  • Execute text pre-extraction.
  • Validate OAK Index Consistency.

Text Pre-extraction Overview

Text pre-extraction is the process of extracting and processing text from binaries that contain extractable text (eg. PDFs, Word Docs, PPTs, TXT, etc.)  Extracting text from binaries is an expensive operation and slows down the indexing rate considerably. Lucene indexing is performed in a single threaded mode.

For incremental indexing this mostly works fine but if performing a re-index or creating the index for the first time after migration then it increases the indexing time considerably. To speed up such cases Oak supports pre extracting text from binaries to avoid extracting text at indexing time. This feature consist of two main steps :-

  1. Extract and store the extracted text from binaries using oak-run tooling.
  2. Configure Oak runtime to use the extracted text at time of indexing via PreExtractedTextProvider
  • Oak text pre-extraction is recommended for re/indexing Lucene indexes on repositories with large volumes of files (binaries) that contain extractable text (eg. PDFs, Word Docs, PPTs, TXT, etc.) that qualify for full-text search via deployed Oak indexes; for example /oak:index/damAssetLucene.
  • Text pre-extraction will only benefit the re/indexing of Lucene indexes, and NOT Oak property indexes, since property indexes do not extract text from binaries.
  • Text pre-extraction has a high positive impact when the full-text re-indexing of text-heavy binaries (PDF, Doc, TXT, etc.), where as a repository of images will not enjoy the same efficiencies since images do not contain extractable text.
  • Text pre-extraction performs the extraction of full-text search related text in a extra-efficient manner, and exposes it to the Oak re/indexing process in a way that is extra-efficient to consume.

When to use Text Pre-extraction

Re-indexing an existing lucene index with binary extraction enabled

  • Re-indexing processing all candidate content in the repository; when the binaries to extract full-text from are numerous or complex, an increased computational burden to performthefull-text extraction is placed on AEM. Text pre-extraction moves the “computationally costly work” of text-extraction into an isolated process that directly accesses AEM’s Data Store, avoiding overhead and resource contention in AEM.

Supporting the deployment of a new lucene index to AEM with binary extraction enabled

  • When a new index (with binary extraction enabled) is deployed into AEM, Oak automatically indexes all candidate content on the next async full-text index run. For the same reasons described in re-indexing above, this may result in undue load on AEM.

When to avoid text pre-extraction

Text pre-extraction cannot be used for new content added to the repository, nor is it necessary.

New content is added to the repository will naturally and incrementally be indexed by the async full-text indexing process (by default, every 5 seconds).

Under normal operation of AEM, for example uploading Assets via the Web UI or programmatic ingest of Assets, AEM will automatically and incrementally full-text index the new binary content. Since the amount of data is incremental and relatively small (approximately the amount of data that can be persisted to the repository in 5 seconds), AEM can perform the full-text extraction from the binaries during indexing without effecting overall system performance.

Prerequisites for using text pre-extraction

Below are few pre requisite before running Text Pre extraction:-

  • A maintenance window to generate the CSV file AND to perform the final re-indexing.
  • The Text pre-extraction OSGi config requires a file system path to the extracted text files, so they must be accessible directly from the AEM instance (local drive or file share mount). Will see it in detail when we execute this step.

Execute Text Pre-extraction

Before proceeding with text extraction , lets take a look at Text extraction architecture. How text extraction of binaries is achieved in AEM.

text-preextraction-architecture

In this tutorial , I will show you pre extraction commands for both windows and linux OS. Please run commands based on your OS.

Step 1 :- Set Up (This is same for both windows and linux)

Navigate to folder where you want to create CSV file. For example on windows i am going to use path C:\Ankur\Learning\text-preextraction and for linux /mnt/preextraction.

You should have correct version of oak-run.jar . AEM 6.4 uses Oak 1.8.x, AEM 6.5 uses Oak 1.10.x; 6.5.5 seems to be on 1.22.3. When in doubt check the version of the org.apache.jackrabbit.oak-core bundle.

check-oak-run-version


Run below command to download OAK run Jar, based on your AEM version.

wget https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/1.10.2/oak-run-1.10.2.jar
download-oak-run-jar

Note:- If you don’t have wget for windows. Please download and add it to system path.

Oak run tool uses a tika command which supports traversing the repository and then extracting text from the binary properties. Download tika-app-1.25.jar.
Run below command to download Tika app Jar.

wget https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.25.jar
download-tika-app-jar

Step 2 :- Generate CSV file

In this step we are generating a csv file which would contain details about the binary property. This file would be generated by using the tika command from oak-run. In this step oak-run would connect to repository in read only mode.

Note:- Generation of csv file scans the whole repository. Hence this step should be run when system is not in active use. Preferable during maintenance window else the new file uploaded to system might be missed in csv.

If your binaries are stored on AEM repository. Use below command to generate the csv file.

// Command for windows system, considering my aem is running in C:\Ankur\Projects\AEM\AEM Plain Vanilla- no extra code

java -jar "C:\Ankur\Learning\Text Preextraction\oak-run-1.10.2.jar" tika --fds-path="C:\Ankur\Projects\AEM\AEM Plain Vanilla- no extra code\crx-quickstart\repository\datastore" "C:\Ankur\Projects\AEM\AEM Plain Vanilla- no extra code\crx-quickstart\repository\segmentstore" --data-file oak-binary-stats.csv --generate 

Note:- I am enclosing file paths in double quotes because i have space in file path. If you don’t have spaces in file path you can remove double quotes. But keeping them will also not affect your script execution.

// Command for unix system, considering my aem is running in /mnt/crx/author
java -jar $(pwd)/oak-run-1.10.2.jar tika --fds-path=/mnt/crx/author/crx-quickstart/repository/datastore /mnt/crx/author/crx-quickstart/repository/segmentstore --data-file oak-binary-stats.csv --generate 

If your binaries are stored on S3. Use below command to generate the csv file.

// Command for windows system, considering my aem is running in C:\Ankur\Projects\AEM\AEM Plain Vanilla- no extra code
java -jar C:\Ankur\Learning\Text Preextraction\oak-run-1.10.2.jar tika --fake-ds-path=temp C:\Ankur\Projects\AEM\AEM Plain Vanilla- no extra code\crx-quickstart\repository\segmentstore --data-file oak-binary-stats.csv --generate 
// Command for unix system, considering my aem is running in /mnt/crx/author
java -jar $(pwd)/oak-run-1.10.2.jar tika --fake-ds-path=temp /mnt/crx/author/crx-quickstart/repository/segmentstore --data-file oak-binary-stats.csv --generate 

After executing above script , you should see below message.

oak-binary-csv-generate
sample-generated-csv

Note:- By default it scans whole repository. If you need to restrict it to look up under certain path then specify the path via –path option.

Execute Text Extraction

Once the csv file is generated we need to perform the text extraction. Currently extracted text files are stored as files per blob in a format which is same one used with FileDataStore In addition to that it creates 2 files :-

  • blobs_error.txt – File containing blobIds for which text extraction ended in error
  • blobs_empty.txt – File containing blobIds for which no text was extracted

Note:- This phase is incremental i.e. if run multiple times and same –store-path is specified then it would avoid extracting text from previously processed binaries.

There are 2 ways of doing this :-

  1. Do text extraction using tika
  2. Use a suitable lucene index to get text extraction data from index itself which would have been generated earlier

Personally I prefer the second way, using suitable lucene index that I need to reindex. But let’s see both the approaches in details below.

Text Extraction using Tika

In order to use tika for text extraction use the –extract command. I am using windows OS, so will be attaching its screenshots.

// Command to be executed on windows, if you are executing on single system for executig from multiple systems use -cp as mentioned in below unix command

java -jar oak-run-1.10.2.jar tika --data-file "C:\Ankur\Learning\Text Preextraction\oak-binary-stats.csv" --store-path "C:\Ankur\Learning\Text Preextraction\store" --fds-path "C:\Ankur\Projects\AEM\AEM Plain Vanilla- no extra code\crx-quickstart\repository\datastore"  extract
// command to be executed on linux

java -cp oak-run-10.2.jar:tika-app-1.25.jar org.apache.jackrabbit.oak.run.Main tika --data-file $(pwd)/oak-binary-stats.csv --store-path $(pwd)/store --fds-path /mnt/crx/author/crx-quickstart/repository/datastore  extract

This command does not require access to NodeStore and only requires access to the BlobStore. So configure the BlobStore which is in use like FileDataStore or S3DataStore. Above command would do text extraction using multiple threads and store the extracted text in directory specified by –store-path.

extract-text-binary

Note:- that we need to launch the command with -cp instead of -jar as we need to include classes outside of oak-run jar like tika-app. Also ensure that oak-run comes before in classpath. This is required due to some old classes being packaged in tika-app.

Populate text extraction store using index

This approach is good , if you have to extract text for specific indexes. Please update path of your lucene indexe accordingly in the command. As shown below:-

// Command for windows
java -jar oak-run-1.10.2.jar tika --data-file "C:\Ankur\Learning\Text Preextraction\oak-binary-stats.csv" --store-path "C:\Ankur\Learning\Text Preextraction\store"  --index-dir "C:\Ankur\Projects\AEM\AEM Plain Vanilla- no extra code\crx-quickstart\repository\index\lucene-1614678654098\data" "C:\Ankur\Projects\AEM\AEM Plain Vanilla- no extra code\crx-quickstart\repository\index\ntBaseLucene-1614678654199\data" populate
// command for unix
find /mnt/crx/publish/crx-quickstart/repository/index -name data -type d -print -exec java -jar $(pwd)/oak-run-1.22.3.jar tika --data-file oak-binary-stats.csv --store-path /mnt/preExtraction/store --index-dir {/mnt/crx/author/crx-quickstart/repository/index} populate \;
extract-text-binary-2

NOTE: This is very important and not making sure of this can lead to incorrectly populating text extraction store. Make sure that no useful binaries are added to the repository between the step that dumped indexed data and the one used for generating binary stats csv.

Updating PreExtractedTextProvider OSGI Configuration

This configuration make sure that OAK uses pre extracted text for indexing. Navigate to felix console –> configurations and search for Apache Jackrabbit Oak DataStore PreExtractedTextProvider

update-preextracted-text-provider-config-windows
Path to Windows file system
update-preextracted-text-provider-config-linux
Path to Linux file system

Note:- Once PreExtractedTextProvider is configured then upon reindexing Lucene indexer would make use of it to check if text needs to be extracted or not. Check TextExtractionStatsMBean for various statistics around text extraction and also to validate if PreExtractedTextProvider is being used

Run OAK Re-Indexing

Navigate to crx/de and select the lucene index for which you want to run re-indexing and update the reindex boolean flag on lucene index to true. For example in below screenshot I am re-indexing dam:AssetLucene index.

reindex-lucene-index

Once reindexing the completed this flag will automatically update to false and reindexCount will increase by 1.

Validate OAK Index Consistency

You can validate whether your index is still valid or corrupted and you need to do re-indexing.

Using Touch UI:-

Navigate to Dashboard –> Tools –> Operations –> Diagnosis –> Index Manager (http://localhost:4502/libs/granite/operations/content/diagnosistools/indexManager.html)

Select the index that you want to validate and click on Consistency check as shown in below screenshot.

Index-Manager-consistency-check

Using Felix Console:-

Navigate to Felix Console –> Status –> Oak Index Stats (http://localhost:4502/system/console/status-oak-index-stats) and check if any index is showing status as corrupt.

osgi-index-validate

Hope you are able to understand Text Pre-extraction of binary assets in AEM. Feel free to drop a comment and let me know if you face any issues.
Document References:-

Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.