Paperless-NGX on Docker
May 8, 2024
This should be easy, since paperless very wisely provides pre-made docker images for you. The problem is that they aren’t complete unless you’re running absolutely bog standard and not so secure. There are others out there who have published their own experiences and I’ll add to the pile since I didn’t find what I was looking for in one place. I deployed the container about 12 times before I got it completely working.
So, just to be perfectly clear, Paperless-ngx is quite a cool piece of software AND it’s the best OCR out there I’ve seen for self-hosted (having tried a number of others). It doesn’t do all the things I want but it does a whole bunch more that I won’t be using for my usecase. It is a SCANNED document management system. That is to say, feed it images or PDFs and it will OCR them and tag them and organize them based on automation or your own cleverness, depending on how you configure it and how big and good the source of documents is for providing fodder for the training of the automation.
It is NOT a media management system, don’t feed it pictures or movies or anything like that (see jellyfin or plex). With the Tika extension it can be used to manage office (.docx, .doc, odt, .ppt, .pptx, .odp, .xls, .xlsx, .ods) documents, the advantage of this being the automation of adding metadata to your documents. I’m not sure this is a usecase I would pursue myself as there are other systems out there (admittedly much more expensive and fussy) for this and frankly it’s overkill for my needs (not that that stopped me from installing tika to see how it worked…)
My specific usecase is simple, I hate working with receipts for taxes. I’m a very tiny business (contractor) with not so many receipts but I still lose them and hate writing it all down at tax time. I’m hoping to use this system to make that task less painful.
I’ll share my docker-compose.yml here (I don’t generally bother with separate env files as I’m interested in keeping things simple) and explain why I’ve set it up this way.
services:
broker:
image: docker.io/library/redis:7
restart: unless-stopped
networks:
- paperless
volumes:
- ./redisdata:/data
db:
image: docker.io/library/postgres:15
restart: unless-stopped
volumes:
- ./pgdata:/var/lib/postgresql/data
networks:
- paperless
environment:
PAPERLESS_TIME_ZONE: America/Los_Angeles
POSTGRES_DB: paperless
POSTGRES_USER: <postgresql-username>
POSTGRES_PASSWORD: <postgresql-password>
gotenberg:
image: docker.io/gotenberg/gotenberg:7.10
restart: unless-stopped
networks:
- paperless
# The gotenberg chromium route is used to convert .eml files. We do not
# want to allow external content like tracking pixels or even javascript.
command:
- "gotenberg"
- "--chromium-disable-javascript=true"
- "--chromium-allow-list=file:///tmp/.*"
tika:
image: ghcr.io/paperless-ngx/tika:latest
restart: unless-stopped
networks:
- paperless
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
restart: unless-stopped
depends_on:
- db
- broker
- gotenberg
- tika
ports:
- "8000:8000"
volumes:
- ./data:/usr/src/paperless/data
- ./media:/usr/src/paperless/media
- ./export:/usr/src/paperless/export
- ./consume:/usr/src/paperless/consume
networks:
- proxy
- paperless
labels:
- "traefik.enable=true"
- "traefik.http.routers.paperless.rule=Host(`paperless.domain.com`)"
- "traefik.http.routers.paperless.entrypoints=websecure"
- "traefik.http.routers.paperless.tls=true"
- "traefik.http.routers.paperless.tls.certresolver=mycertresolver"
- "traefik.http.services.paperless.loadbalancer.server.port=8000"
- "traefik.http.routers.paperless.service=paperless"
- "traefik.http.middlewares.compresspaperless.compress=true"
- "traefik.docker.network=proxy"
environment:
PAPERLESS_TIME_ZONE: America/Los_Angeles
PAPERLESS_URL: https://paperless.domain.com
PAPERLESS_SECRET_KEY: strong-key
PAPERLESS_REDIS: redis://broker
PAPERLESS_OCR_LANGUAGE: eng
PAPERLESS_DBHOST: db
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998
PAPERLESS_DBUSER: <postgresql-username>
PAPERLESS_DBPASS: <postgresql-password>
PAPERLESS_ENABLE_COMPRESSION: false
networks:
proxy:
external: true
paperless:
NOTES
- Networks: settings in ALL the containers… if you don’t set this to the same thing in all the containers here being configured, the containers can’t see each other and they can’t interact and you can’t even set the superuser up let alone get anything else working. Here I’ve set an internal network called paperless for all the containers… this is nice as it isolates this from all my other containers.
- PAPERLESS_DBUSER: <postgresql-username> and PAPERLESS_DBPASS: <postgresql-password> settings in the webserver. These are critical if you don’t want to leave these as defaults (which always makes me uncomfortable). The system will work perfectly to setup your superuser if you just set the password on the database but then the web interface can’t run as it can’t connect to the database without the password. Definitely set these… it will just make life easier.
- Traefik is doing compression of web traffic instead of making paperless do it (this is recommended in the documentation). That means turning the default compression by paperless off in the environment variable PAPERLESS_ENABLE_COMPRESSION: false.
- Note you can set a memory limit on how much memory image-magick is allowed to consume… in service of the OCR function. eg PAPERLESS_CONVERT_MEMORY_LIMIT: 100MB. There are issues with this setting as setting it too high will be ignored, you’d have to set it in a policy.xml as described here. Managing this limit is particularly important if you want to OCR 100 pages in one go.
- Once you’ve run the setup for the superuser note that the db, tika, gottenberg and redis containers are all running. If you want to unravel your superuser setup (say for example you forget the password while you’re doing all this) you need to stop these containers BEFORE getting rid of the data (docker stop <container name>). Once you get rid of the data then you will be able to restart from scratch.
- Note, this article assumes you’ve already got traefik up and running and have it setup to manage SSL certificates through a resolver configured in the traefik container and you understand traefik entry points and networks in Docker.
So I mentioned that paperless-ngx doesn’t do one thing I want. That’s reporting or export of metadata. I’ve added custom fields to capture my receipt data and I want to get that out of the system. There is an API for pulling all of the documents and their metadata down but I think there’s a better way and I’ve setup a local version of pgadmin to explore that possibility. I think I should be able to build a SQL query to pull the data I need for my receipts for taxes… stay tuned for that posting…