Adding Metadata and Full-text indexing to Rich Documents and Assets in CrafterCMS
Russ Danner
There are many use cases and types of experiences where we want to treat specific types of assets like rich documents, videos and high resolution images as first class content objects in the CMS with their own custom metadata and indexing of in-file metadata. CrafterCMS enables you to "jacket" assets with content type to support these scenarios.
In this video blog we cover:
- What is a document/asset jacket
- How document full-text and custom metadata indexing works
- How to configure your Crafter Studio project and deployer to support custom metadata for rich documents
Important Documentation Links:
- Deployer Documentation: https://docs.craftercms.org/current/reference/modules/deployer/index.html
- Default Deployer Target Configuration : https://github.com/craftercms/deployer/blob/develop/src/main/resources/base-target.yaml
- Permission Configuration: https://docs.craftercms.org/current/reference/modules/studio/administration.html#roles-and-permissions
- Studio Configuration: https://docs.craftercms.org/current/reference/modules/studio/configuration/project-configuration.html#project-configuration
- UI Configuration:
https://docs.craftercms.org/current/by-role/project-admin/index.html#ui
Sample Configuration:
Below you will find the configuration examples covered in the video.
Project Config (site-config.xml):
<folders>
<folder name="Pages" path="/website" read-direct-children="false" attach-root-prefix="true"/>
<folder name="Components" path="/components" read-direct-children="false" attach-root-prefix="true"/>
<folder name="Documents" path="/documents" read-direct-children="false" attach-root-prefix="true"/>
<folder name="Taxonomy" path="/taxonomy" read-direct-children="false" attach-root-prefix="true"/>
<folder name="Assets" path="/static-assets" read-direct-children="false" attach-root-prefix="false"/>
<folder name="Templates" path="/templates" read-direct-children="false" attach-root-prefix="false"/>
<folder name="Scripts" path="/scripts" read-direct-children="false" attach-root-prefix="false"/>
</folders>
...
<pattern-group name="component">
<pattern>/site/components/([^<]+)\.xml</pattern>
<pattern>/site/documents/([^<]+)\.xml</pattern>
<pattern>/site/system/page-components/([^<]+)\.xml</pattern>
<pattern>/site/component-bindings/([^<]+)\.xml</pattern>
<pattern>/site/indexes/([^<]+)\.xml</pattern>
<pattern>/site/resources/([^<]+)\.xml</pattern>
</pattern-group>
Permissions Config (permissions.xml):
<permissions>
<version>4.1.2</version>
<role name="author">
<rule regex="/site/website/.*">
<allowed-permissions>
<permission>content_read</permission>
<permission>content_write</permission>
<permission>content_create</permission>
<permission>folder_create</permission>
<permission>get_children</permission>
<permission>content_copy</permission>
</allowed-permissions>
</rule>
<rule regex="/site/components|/site/components/.*">
<allowed-permissions>
<permission>content_read</permission>
<permission>content_write</permission>
<permission>content_create</permission>
<permission>folder_create</permission>
<permission>get_children</permission>
<permission>content_copy</permission>
</allowed-permissions>
</rule>
Target YAML:
binary:
# The list of binary file mime types that should be indexed
supportedMimeTypes:
- application/pdf
- application/msword
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.ms-excel
- application/vnd.ms-powerpoint
- application/vnd.openxmlformats-officedocument.presentationml.presentation
# The regex path patterns for the metadata ("jacket") files of binary/document files
metadataPathPatterns:
- ^/?site/documents/.+\.xml$
# The regex path patterns for binary/document files that are store remotely
remoteBinaryPathPatterns: &remoteBinaryPathPatterns
# HTTP/HTTPS URLs are only indexed if they contain the protocol (http:// or https://). Protocol relative
# URLs (like //mydoc.pdf) are not supported since the protocol is unknown to the back-end indexer.
- ^(http:|https:)//.+$
- ^/remote-assets/.+$
# The regex path patterns for binary/document files that should be associated to just one metadata file and are
# dependant on that parent metadata file, so if the parent is deleted the binary should be deleted from the index
childBinaryPathPatterns: *remoteBinaryPathPatterns
# The XPaths of the binary references in the metadata files
referenceXPaths:
- //item/key
- //item/url
# Setting specific for authoring indexes
authoring:
# Xpath for the internal name field
internalName:
xpath: '*/internal-name'
includePatterns:
- ^/?site/.+$
- ^/?static-assets/.+$
- ^/?remote-assets/.+$
- ^/?scripts/.+$
- ^/?templates/.+$
contentType:
xpath: '*/content-type'
# Same as for delivery but include images and videos
supportedMimeTypes:
- application/pdf
- application/msword
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.ms-excel
- application/vnd.ms-powerpoint
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/x-subrip
- image/*
- video/*
- audio/*
- text/x-freemarker
- text/x-groovy
- text/javascript
- text/css
# The regex path patterns for the metadata ("jacket") files of binary/document files
metadataPathPatterns:
- ^/?site/documents/.+\.xml$
binaryPathPatterns:
- ^/?static-assets/.+$
- ^/?remote-assets/.+$
- ^/?scripts/.+$
- ^/?templates/.+$
# Look into all XML descriptors to index all binary files referenced
binarySearchablePathPatterns:
- ^/?site/.+\.xml$
# Additional metadata such as contentLength, content-type specific metadata
metadataExtractorPathPatterns:
- ^/?site/.+$
excludePathPatterns:
- ^/?config/.*$
# Include all fields marked as remote resources (S3, Box, CMIS)
referenceXPaths:
- //item/key
- //item/url
- //*[@remote="true"]
Related Posts
What Is HTMX?
Amanda Jones
CrafterCMS: A Modern Open Source Alternative to WordPress
Amanda Lee
Building Future-Ready B2B Commerce Experiences with Headless CMS
Sara Williams
Basic Digital Asset Management with a Headless CMS
Amanda Jones