Trouble-Shooting Celery & RabbitMQ For Open edX

Problems with Celery? Learn how to diagnose and correct configuration errors with Celery and RabbitMQ in your Open edX instance. You’ll get your Open edX instance back online in around 30 minutes or less with this how-to guide.

Background

A common pattern that you’ll see in Python Django projects like Open edX is Celery + RabbitMQ + Redis. This trio of open source technology provides a robust and scalable means for applications to communicate asynchronously with other back-end resources. The results are impressive: your application can interact with remote email systems, grader programs, MySQL, MongoDB and the file system on your Ubuntu server in a sophisticated way that not only prevents the front-end from freezing while waiting for responses, but also makes the platform completely resilient in the event of catastrophic system failures.

If you’re certain that Celery is not working correctly then you can skip to Trouble-Shooting Celery.

Celery is a distributed task queue that works exclusively with Python, and is a common complement to Django applications. The execution units, called tasks, are executed concurrently on one or more worker servers. Tasks can execute asynchronously or synchronously. RabbitMQ meanwhile is a popular open source message broker. RabbitMQ is lightweight and easy to deploy on premises and in the cloud. It supports multiple messaging protocols, including the Advanced Message Queueing Protocol (AMQP) used by Open edX. RabbitMQ can be deployed in distributed and federated configurations to meet high-scale, high-availability requirements. It runs on many operating systems and cloud environments, and provides a wide range of developer tools for most popular languages.

Mahdi Yusuf created this great screencast that demonstrates how Celery + RabbitMQ + Redis work together in a Django app to generate an email during a new user signup operation. This is especially relevant since Open edX performs these exact operations in its new user Registration screen.

Your Open edX instance relies extensively on Celery + RabbitMQ for a host of common application operations:

  • New user registration
  • Drag & Drop UI functionality
  • Uploading documents
  • Sending email to users

  • Grading individual course exercise problems

If you experience unusual behavior from any of these functions then often the culprit is probably a configuration problem with Celery.

1. Diagnosing Problems With Celery / RabbitMQ

Celery and RabbitMQ are both highly stable subsystems that generally work reliably without any administrative oversight whatsoever. If I encounter problems with either subsystem it is almost always following a software upgrade, a database restore, or a server migration. Furthermore, the culprit is almost always Celery. Following are some common symptoms of a configuration problem with Celery in an Open edX platform.

Application OperationSymptom
New user registrationThe new user registration screen appears to die, and becomes unresponsive after clicking the signup command button. The new user data is never saved into the system and the new user never receives an activation email.
Sending emailThe screen appears to freeze or die after you click the password reset button.
New users do not receive their new user activation email.
Drag & DropThe drag & drop function appears to work, however the changed value is not recognized. Additionally you cannot save results.
Remote grading programThe screen appears to die after submitting a response to an exercise or quiz problem.
Uploading documentThe screen appears to die, and the document is never uploaded. The system provide neither a success nor a failure message.

If you’re experiencing any of these symptoms then you’ll next want to review the Open edX application logs for both the LMS and CMS to look for errors.

cat /edx/var/log/lms/edx.log -n 50
cat /edx/var/log/cms/edx.log -n 50

In particular, Celery often presents some challenges after migrations, upgrades and database restore operations. If Celery is not functioning correctly then you’ll lots of errors in the LMS log of the following form:

Jul 25 20:52:01 ip-172-31-45-151 [service_variant=lms][celery.worker.consumer][env:sandbox] ERROR [ip-172-31-45-151 2089] [consumer.py:366] - consumer: Cannot connect to amqp://celery@127.0.0.1:5672//: [Errno 104] Connection reset by peer. Trying again in 2.00 seconds...
2. Trouble-Shooting Celery

RabbitMQ (and Celery) was installed by Ansible when you performed your native build. While there are many steps to installing RabbitMQ, it turns out that the configuration itself is relatively simple and thus, easy to trouble-shoot since there are a finite and limited set of configuration values to check. The configuration consists of the following

  • Two Celery configuration values located in /etc/rabbitmq/rabbitmq-env.conf
  • Three Celery usernames with passwords, and assigned permissions
  • One virtual host

You can attempt any combination of the following trouble-shooting methods, testing your results after each adjustment by attempting any operation in your LMS such as providing a response to any problem, or by requesting a password reset email.

Celery Trouble-Shooting Tip I: Verify the IP address in /etc/rabbitmq/rabbitmq-env.conf

The correct internal IP address for address RabbitMQ is 127.0.0.1. However, sometimes Ansible will incorrectly populate this value with the actual value of the server’s internal IP address, such as for example, 172.16.102.101. I often encounter this problem whenever I reinstall RabbitMQ during platform upgrades.

sudo vim /etc/rabbitmq/rabbitmq-env.conf

Edit this file if necessary, and then restart the RabbitMQ service.

Celery Trouble-Shooting Tip II: Set permissions of all Celery users

The following code block relaxes permissions for the username “celery”. This is anecdotally the same as setting permissions of a Linux file to “777”. If the source of your Celery problem is permissions then this will eliminate the problem, noting however that afterwards you should seek more information on the ramifications of relaxing Celery permissions in Open edX (sorry, but I’m no expert).

sudo rabbitmqctl set_permissions -p / celery ".*" ".*" ".*"
sudo service rabbitmq-server restart
Celery Trouble-Shooting Tip III: Reset Celery user passwords

If you followed my guidelines for a Native Build on Ubuntu 16.04 LTS then you (hopefully) have a file named my-passwords.yml located in /home/ubuntu. Per the illustration below, the passwords for the three Celery users is located at the bottom of this file, noting that in each case the value of the password is referenced from elsewhere in the same document. I’ve attempted to illustrate how this referencing scheme works by highlighting the appropriate row in the file for the “Admin” user’s password.

sudo rabbitmqctl change_password celery YourPasswordForTheCeleryUser
sudo rabbitmqctl change_password edx YourPasswordForTheEdxUser
sudo rabbitmqctl change_password admin YourPasswordForTheAdminUser
sudo service rabbitmq-server restart
Celery Trouble-Shooting Tip IV: Re-install Celery

Some combination of the previous trouble-shooting methods very likely will solve your problem. But, if you’re still having problems then you can completely install RabbitMQ by calling the appropriate Ansible playbook, as follows:

sudo bash
./edx/app/edx_ansible/venvs/edx_ansible/bin/activate
cd /edx/app/edx_ansible/edx_ansible/playbooks/
ansible-playbook -c local -i 'localhost,' ./run_role.yml -e "role=rabbitmq"
#Use this command instead if you are using a server-vars.yml file
#ansible-playbook -c local -i 'localhost,' ./run_role.yml -e "role=rabbitmq" -e@/edx/app/edx_ansible/server-vars.yml
exit
sudo service rabbitmq-server restart

Last thing, I found the following two threads from the Open edX Devops Google Group very helpful the first time I first encountered problems with Celery:

3. Restart Platform

If you performed any of the curative actions in the section above then you should restart your Open edX platform. For most administrative tasks you only need to restart the LMS and CMS but in this case its a good idea to restart everything.

sudo rm /edx/var/log/lms/edx.log        #delete the current active log (to simplify diagnostics in the next step)
sudo rm /edx/var/log/cms/edx.log        #delete the current active log (to simplify diagnostics in the next step)

# Option I: reboot the server
sudo reboot

#Option II: restart the Open edX services individually
sudo /edx/bin/supervisorctl restart lms
sudo /edx/bin/supervisorctl restart cms
sudo /edx/bin/supervisorctl restart edxapp_worker:
sudo /edx/bin/supervisorctl restart analytics_api
sudo /edx/bin/supervisorctl restart certs
sudo /edx/bin/supervisorctl restart discovery
sudo /edx/bin/supervisorctl restart ecommerce
sudo /edx/bin/supervisorctl restart ecomworker
sudo /edx/bin/supervisorctl restart forum
sudo /edx/bin/supervisorctl restart insights
sudo /edx/bin/supervisorctl restart notifier-celery-workers
sudo /edx/bin/supervisorctl restart notifier-scheduler
sudo /edx/bin/supervisorctl restart xqueue
sudo /edx/bin/supervisorctl restart xqueue_consumer

I hope you found this helpful. Please help me improve this article by leaving a comment below. Thank you!

By |2018-08-16T16:52:36+00:00August 16th, 2018|Categories: Dev Ops, Open edX|0 Comments

About the Author:

Lawrence is a full stack developer specializing in the Open edX platform, Django, Angular, Ionic, Wordpress and Amazon Web Services. He lives in Puerto Escondido, Oaxaca, Mexico.

Leave A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.