Strawberry Runners Post-Processing Configuration
Archipelago's Strawberry Runners (SBR) module provides provides a set of post-processing capabilities for the JSON based metadata, files and entities that comprise your Archipelago Digital Objects (ADOs). These post-processing actions are based on dispatched events, direct http calls, and invoked webhooks from partner services (such as Min.io, AWS S3 or self-invoked).
The default Archipelago SBR post-processor configurations include operations that:
- perform page-based HOCR/OCR for image and pdf-based ADOs, send the output to the Search API, and use Natural Language Processing to extract entities from the output
- extract text from pages within a Webarchives File and send the output to the Search API
- convert WARC format Webarchives Files into WACZ format and attach the new WACZ file to the original source ADO to complement the WARC original
- extract textual values from subtitle/transcript VTT files and generates time/space transmuted OCR
- extract text from Files and send the output to the Search API
SBR actions can be chained and nested to enable ordered operations, such as first extract individual pages in an ordered sequence and then run HOCR/OCR across the individual pages.
Important Note: Local vs Live/Production Instances
The Local Archipelago Deployment features a few additional post-processor operations related to the Experimental ML Tools. Please refer to that documentation for more information about those additional post-processors and their usage.
This guide only covers the primary Strawberry Runners Post-Processors.
Strawberry Runners Settings Overview
You can access the Strawberry Runners Settings:
- Through the
Managemenu >Configuration>Archipelago>Configure Strawberry Runners Post Processors - Directly at
/admin/config/archipelago/strawberry_runners
On the Strawberry Runners Settings page, you will see the Archipelago default post processor configurations (unless modified).
- The
pageraction uses the 'Post processor that extracts/generates Ordered Sequences of files/pages/children using Files present in an ADO' plugin. - Nested one level in, the
ocraction uses the 'Post processor that Runs OCR/HORC against files' plugin. Theocroperations will be executed after the completion of thepageroperations. - The
wacz_page_extractoraction uses the 'Post processor that extracts/generates Indexed Page Content from WACZ files in an ADO' plugin. - Nested one level in, the
webpageaction uses the 'Post processor that Indexes WACZ Frictionless data Search Index to Search API' plugin. Thewebpageoperations will be executed after the completion of thewacz_page_extractoroperations. - The
warc_to_waczaction uses the 'Post processor that uses a System Binary to process * files' operations. - The
subtitleaction extracts textual values from subtitle/transcript VTT files and generates time/space transmuted OCR. This transmuted OCR can be used to search within a time-based video or audio file's corresponding subtitle/transcript VTT file(s), then navigate to the matching time of the video or audio file within a media viewer. - The
textaction extracts text from Files.
Reviewing and Adjusting the default Post-Processors
From the main Strawberry Runner Settings page, you can review and adjust the settings for the default Archipelago configurations by selecting Edit from the Operations menu.
Please see the following guides for:
- Adjusting the
pagerandocroperations - Adjusting the
wacz_page_extractorandwebpageoperations - Adjusting the
warc_to_waczoperation - Adjusting the
subtitleoperation - Adjusting the
textoperation
Triggering Post-Processing Actions Manually
After making adjustments to Strawberry Runners Post-Processing configurations, you may want to trigger/re-trigger a particular action manually.
You can use Archipelago's Find and Replace to first select a specific group of Digital Objects you wish to target for Post-Processing, then select the Trigger Strawberrry Runners process/reprocess for Archipelago Digital Objects content item from the Find and Replace Actions menu.
For Archipelago 1.5.0 and up, you can also use the 'Run Action on Processed ADOs' tab to run that same Action as well.
Additional Post Processor Operations
Archipelago also includes the Post processor that writes/reads Frictionless Data Packages plugin. Please keep a lookout for future documentation related to using this plugin.
Return to the Archipelago Documentation main page.
