Debugging from the Field: Smartsense Activity Explorer Stops Working
What Is Smartsense Activity Explorer?
Smartsense is a tool that comes with the Hortonworks Data Platform (HDP) for all modern versions since Ambari 2.2. On the outside, Smartsense is a tool that takes data gathered by Ambari Metrics (the system that gives you the dashboards in Ambari and the Grafana dashboard with more detailed information), anonymizes it, and sends it to Hortonworks for analysis. Recommendations are then generated that help you better maintain your cluster. This alone is a great tool that I recommend any HDP or HDF customer use.
Activity Explorer came along later, but it is quickly becoming one of my favorite tools for managing real clusters. It has various dashboards that query the data from Ambari Metrics and shows you nice graphs, such as the one below, which shows the number of files per user in HDFS, for the top 15 users in terms of the average number of files over the last 60 days.
Technically speaking, Activity Explorer is just a separate Zeppelin instance with a Phoenix interpreter attached. The Phoenix interpreter points to the AMS instance of HBase, which stores the data we want to analyze. It comes out of the box with four dashboards: HDFS Dashboard, Tez and MapReduce Dashboard, YARN Dashboard, and Chargeback Dashboard. You can modify these dashboards, or build new ones, just like you would do in a normal Zeppelin environment.
Now that we have a brief overview of the system set up, let's get into an issue we ran into using the Activity Explorer dashboard.
Issues between Zeppelin and Phoenix
While I was working at a client, we came to a point that one of the dashboards in the activity explorer would help out a lot. This client didn't use the activity explorer much, but knew about it, and wanted to start using it more. So I logged into the activity explorer and found that all of the data is months out of date, even though the notebooks are scheduled to run multiple times a day. I tried to run one of the paragraphs and immediately get back the following error.
This error is very vague but made me immediately consider if the Phoenix servers were not working properly. Ambari stated that the servers were up, but just to check, I logged into the same node that the Activity Explorer was running, and tried to run Phoenix queries against the same data. Remember from our background above that this is a separate AMS HBase instance, not the default one on your cluster, so just using psql from phoenix-client won't work. Instead, I needed to use the HBase client configuration in /etc/smartsense-activity/conf, which points to the AMS HBase instance. To do this, I ran something similar to the following.
Trying to run using this configuration, I got an error due to a permissions issue. Come to find out, hbase-site.xml in that folder was using a keytab I didn't have access to. To work around this, I copied the configuration folder to my home directory and modified the keytab and principal information to point to a keytab and principal I could access that also had access to the AMS HBase instance. Trying again, I ran the following commands.
This time it worked, and it opened up a JDBC connection terminal. I tried a few basic queries and was able to successfully see the tables and do some queries on them. After all this, it's obvious that Phoenix is working properly, no issues there.
So now the question is, where is the issue?
With this information, the next step we took a look at the logs for the activity explorer. I ran a tail -f command on all files within the /var/log/smartsense-activity folder, and got the following output.
As you can see from the log, while it has the error we saw earlier, there isn't any more information that can help us. Additionally, I noted that there was no real error message like you normally see. So that's a dead end as well.
We then tried restarting the interpreter and restarting the service overall, none of which seemed to have an impact. We did notice the run taking some time before it failed when we restarted, however, which gave us a clue.
Finally, A Solution
Finally, during one of these restarts, we looked at the interpreter log and found the following.
Now we are getting somewhere! The key is at the bottom of the call stack, where it mentions a class it can't find. This class is for WANDisco Fusion, a third-party tool the client was using to replicate data between the cluster and the cloud. This is similar in functionality to Apache Falcon or its successor from Hortonworks, Hortonworks Dataplane.
In fact, this issue had been seen in a few other places before, including when we tried to start spark-shell on some edge nodes. Because of that, we knew pretty quickly what needed to be done. The issue was that the JAR containing this class wasn't available in the classpath for the activity explorer instance. Once we added this JAR to the classpath, everything started to work, and we were able to use the Activity Explorer properly.
With that fixed, I was finally able to do what I needed to with the dashboards in the activity explorer. But the important thing to take from this is that Zeppelin issues often come up when the interpreter is first running. Since the interpreter is a constantly-running process, if an error occurs on start-up, then it will not try again, and instead just give you a bland error. Instead, if the logs during the run don't give you much information, try looking at the logs when the interpreter is restarted and first used again. Oftentimes with Zeppelin, this is the key to finding the right log message to point you in the right direction.