Configuring and Starting Hadoop, Stratosphere and Spark Servers for bioKepler or DDP
This page describes how to configure and run the Hadoop, Stratosphere and spark servers separately (distributed mode) included with the bioKepler and DDP suites.
Hadoop
Hadoop module 1.1
Hadoop module 1.1 is developed based on Hadoop 2.2.0. The BioKepler and DDP suites include the binaries, libraries, and configuration files necessary to run a Hadoop server. These files are located in $HOME/KeplerData/workflows/module/hadoop-1.1.0/tools.
The following steps describe how to configure and start the Hadoop server included in the bioKepler/DDP. If Hadoop does not start, look at the log files in logs/.
Linux and Mac
- Set JAVA_HOME in etc/hadoop/hadoop-env.sh to where java is installed on your computer.
- Make sure all files in bin/ and sbin/ and etc/ are executable.
- Before starting Hadoop for the first time, format the namenode by running:
-
Start Hadoop.
-
Under Mac OS X, if start-hadoop.sh fails with "localhost: ssh: connect to host localhost port 22: Connection refused", then go to the System Preferences, select Sharing and then enable Remote Login.
-
Stop Hadoop:
chmod a+x bin/* sbin/* etc/*
bin/format-namenode.sh
bin/start-hadoop.sh
bin/stop-hadoop.sh
Windows
-
Download and install Cygwin, including OpenSSH and Open ssl packages in "C:\cygwin" and update your PATH environment variable:
c:\cygwin\bin;c:\cygwin\usr\bin
-
Hadoop requires password-less SSH access to manage its nodes. Set up authorization keys to be used by hadoop when ssh’ing to localhost:
-
Set JAVA_HOME parameter in conf/hadoop-env.sh (line 8) to where java is installed on your computer.
-
Set cygwin path translation in bin/hadoop-config.sh (line 184):
- Before starting Hadoop for the first time, format the namenode by running:
- Start Hadoop:
- Stop Hadoop:
ssh-keygen -t rsa -P "" cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# cygwin path translation if $cygwin; then JAVA_HOME=`cygpath -w "$JAVA_HOME"` CLASSPATH=`cygpath -wp "$CLASSPATH"` HADOOP_HOME=`cygpath -w "$HADOOP_HOME"` HADOOP_LOG_DIR=`cygpath -w "$HADOOP_LOG_DIR"` JAVA_LIBRARY_PATH=`cygpath -w "$JAVA_LIBRARY_PATH"` TOOL_PATH=`cygpath -wp "$TOOL_PATH"` fi
./bin/format-namenode.sh
bin/start-hadoop.sh
bin/stop-hadoop.sh
Hadoop module 1.0
The BioKepler and DDP suites include the binaries, libraries, and configuration files necessary to run a Hadoop server. These files are located in $HOME/KeplerData/workflows/module/hadoop-1.0.0/tools.
The following steps describe how to configure and start the Hadoop server included in the bioKepler/DDP. If Hadoop does not start, look at the log files in logs/.
Linux and Mac
- Set JAVA_HOME in conf/hadoop-env.sh to where java is installed on your computer.
- Make sure all files in bin/ and conf/hadoop-env.sh are executable.
- Before starting Hadoop for the first time, format the namenode by running:
-
Start Hadoop:
echo "Y" | bin/hadoop namenode -format
bin/start-hadoop.sh
Windows
-
Download and install Cygwin, including OpenSSH and Open ssl packages in "C:\cygwin" and update your PATH environment variable:
c:\cygwin\bin;c:\cygwin\usr\bin
-
Hadoop requires password-less SSH access to manage its nodes. Set up authorization keys to be used by hadoop when ssh’ing to localhost:
-
Set JAVA_HOME parameter in conf/hadoop-env.sh (line 8) to where java is installed on your computer.
-
Set cygwin path translation in bin/hadoop-config.sh (line 184):
- Before starting Hadoop for the first time, format the namenode by running:
- Start Hadoop:
ssh-keygen -t rsa -P "" cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# cygwin path translation if $cygwin; then JAVA_HOME=`cygpath -w "$JAVA_HOME"` CLASSPATH=`cygpath -wp "$CLASSPATH"` HADOOP_HOME=`cygpath -w "$HADOOP_HOME"` HADOOP_LOG_DIR=`cygpath -w "$HADOOP_LOG_DIR"` JAVA_LIBRARY_PATH=`cygpath -w "$JAVA_LIBRARY_PATH"` TOOL_PATH=`cygpath -wp "$TOOL_PATH"` fi
./bin/hadoop namenode –format
bin/start-hadoop.sh
Stratosphere
The BioKepler and DDP suites include the binaries, libraries, and configuration files necessary to run a Stratosphere server. These files are located in $HOME/KeplerData/workflows/module/stratosphere-1.2.0/tools.
The following steps describe how to configure and start the Stratosphere server included in the bioKepler/DDP. If Stratosphere does not start, look at the log files in logs/.
Linux and Mac
- Make sure all files in bin/ are executable.
- Start Stratosphere:
bin/start-local.sh
Windows
-
Set JAVA_HOME in bin/nephele-config.sh (line 52) to where java is installed on your computer.
-
Set NEPHELE_JM_CLASSPATH, NEPHELE_CONF_DIR and log_setting in bin/nephele-jobmanager.sh (line 117 and line 124)
- Start Stratosphere:
NEPHELE_JM_CLASSPATH=`cygpath -wp $NEPHELE_JM_CLASSPATH` log_setting="-Dlog.file="$log" -Dlog4j.configuration="$NEPHELE_CONF_DIR"/log4j.properties"
bin/start-local.sh
Spark
Linux and Mac
The assembly JAR necessary to start the Spark Master is not included with Kepler since it is over 100MB. To build this jar use the following steps:
- Download the source code for Spark 1.1.0.
- Extract the source.
- Build the assembly JAR.
- Copy the assembly JAR to KeplerData.
untar xzpf spark-1.1.0.tgzor
tar -xvzf spark-1.1.0.tgz
cd spark-1.1.0 SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assemblyNote: If you have errors like "[error] Server access Error: Too many open files url" when running the command, it is probably due to network problems while downloading. Try to re-run the command.
mkdir -p $HOME/KeplerData/workflows/module/spark/tools/assembly/target/scala-2.10 cp assembly/target/scala-2.10/spark-assembly-1.1.0-hadoop2.2.0.jar $HOME/KeplerData/workflows/module/spark/tools/assembly/target/scala-2.10/