Some Little Bits: 2014

Thursday, October 9, 2014

Linux Fu

Awesome Quora answer to "What are some time-saving tips that every Linux user should know?" by Joshua Levy

Reproduced here for convenience.

Here is a selection of command-line tips that I've found useful when working on Linux. The emphasis is on somewhat less-known techniques that are generally important or useful to technical users. It's a bit long, and users certainly don't need to know all of them, but I've done my best to review that each item is worth reading in terms of projected time savings, if you use Linux heavily.

To get more information on a command mentioned, first try "man <command name>". In some cases, you must install a package for this to work -- try aptitude or yum. If that fails, Google it.

Basics

Learn basic Bash. Actually, read the whole bash man page; it's pretty easy to follow and not that long. Alternate shells can be nice, but bash is powerful and always available (learning mainly zsh or tcsh restricts you in many situations).
Learn vim. There's really no competition for random Linux editing (even if you use Emacs or Eclipse most of the time).
Know ssh, and the basics of passwordless authentication, via ssh-agent, ssh-add, etc.
Be familiar with bash job management: &, Ctrl-Z, Ctrl-C, jobs, fg, bg, kill, etc.
Basic file management: ls and ls -l (in particular, learn what every column in "ls -l" means), less, head, tail and tail -f, ln and ln -s (learn the differences and advantages of hard versus soft links), chown, chmod, du (for a quick summary of disk usage: du -sk *), df, mount.
Basic network management: ip or ifconfig, dig.
Know regular expressions well, and the various flags to grep/egrep. The -o, -A, and -B options are worth knowing.
Learn to use apt-get or yum (depending on distro) to find and install packages.

Everyday use

In bash, use Ctrl-R to search through command history.

In bash, use Ctrl-W to kill the last word, and Ctrl-U to kill the line. See man readline for default keybindings in bash. There are a lot. For example Alt-. cycles through prevous arguments, and Alt-* expands a glob.
To go back to the previous working directory: cd -
If you are halfway through typing a command but change your mind, hit Alt-# to add a # at the beginning and enter it as a comment (or use Ctrl-A, #, enter). You can then return to it later via command history.
Use xargs (or parallel). It's very powerful. Note you can control how many items execute per line (-L) as well as parallelism (-P). If you're not sure if it'll do the right thing, use xargs echo first. Also, -I{} is handy. Examples:

find . -name \*.py | xargs grep some_function

cat hosts | xargs -I{} ssh root@{} hostname

pstree -p is a helpful display of the process tree.
Use pgrep and pkill to find or signal processes by name (-f is helpful).
Know the various signals you can send processes. For example, to suspend a process, use kill -STOP [pid]. For the full list, see man 7 signal
Use nohup or disown if you want a background process to keep running forever.
Check what processes are listening via netstat -lntp. See also lsof.
In bash scripts, use set -x for debugging output. Use set -e to abort on errors. Consider using set -o pipefail as well, to be strict about errors (though this topic is a bit subtle). For more involved scripts, also use trap.
In bash scripts, subshells (written with parentheses) are convenient ways to group commands. A common example is to temporarily move to a different working directory, e.g.

# do something in current dir

(cd /some/other/dir; other-command)

# continue in original dir

In bash, note there are lots of kinds of variable expansion. Checking a variable exists: ${name:?error message}. For example, if a bash script requires a single argument, just write input_file=${1:?usage: $0 input_file}. Arithmetic expansion: i=$(( (i + 1) % 5 )). Sequences: {1..10}. Trimming of strings: ${var%suffix} and ${var#prefix}. For example if var=foo.pdf, then echo ${var%.pdf}.txt prints "foo.txt".

The output of a command can be treated like a file via <(some command). For example, compare local /etc/hosts with a remote one: diff /etc/hosts <(ssh somehost cat /etc/hosts)
Know about "here documents" in bash, as in cat <<EOF ....
In bash, redirect both standard output and standard error via: some-command >logfile 2>&1. Often, to ensure a command does not leave an open file handle to standard input, tying it to the terminal you are in, it is also good practice to add "</dev/null".
Use man ascii for a good ASCII table, with hex and decimal values.
On remote ssh sessions, use screen or dtach to save your session, in case it is interrupted.
In ssh, knowing how to port tunnel with -L or -D (and occasionally -R) is useful, e.g. to access web sites from a remote server.
It can be useful to make a few optimizations to your ssh configuration; for example, this .ssh/config contains settings to avoid dropped connections in certain network environments, not require confirmation connecting to new hosts, forward authentication, and use compression (which is helpful with scp over low-bandwidth connections):

TCPKeepAlive=yes

ServerAliveInterval=15

ServerAliveCountMax=6

StrictHostKeyChecking=no

Compression=yes

ForwardAgent=yes

To get the permissions on a file in octal form, which is useful for system configuration but not available in "ls" and easy to bungle, use something like

stat -c '%A %a %n' /etc/timezone

Data processing

To convert HTML to text: lynx -dump -stdin
If you must handle XML, xmlstarlet is good.
For Amazon S3, s3cmd is convenient (albeit immature, with occasional misfeatures).
Know about sort and uniq (including uniq's -u and -d options).
Know about cut, paste, and join to manipulate text files. Many people use cut but forget about join.
It is remarkably helpful sometimes that you can do set intersection, union, and difference of text files via sort/uniq. Suppose a and b are text files that are already uniqued. This is fast, and works on files of arbitrary size, up to many gigabytes. (Sort is not limited by memory, though you may need to use the -T option if /tmp is on a small root partition.)

cat a b | sort | uniq > c # c is a union b

cat a b | sort | uniq -d > c # c is a intersect b

cat a b b | sort | uniq -u > c # c is set difference a - b

Know that locale affects a lot of command line tools, including sorting order and performance. Most Linux installations will set LANG or other locale variables to a local setting like US English. This can make sort or other commands run many times slower. (Note that even if you use UTF-8 text, you can safely sort by ASCII order for many purposes.) To disable slow i18n routines and use traditional byte-based sort order, use export LC_ALL=C (in fact, consider putting this in your .bashrc).
Know basic awk and sed for simple data munging. For example, summing all numbers in the third column of a text file: awk '{ x += $3 } END { print x }'. This is probably 3X faster and 3X shorter than equivalent Python.
To replace all occurrences of a string in place, in one or more files:

perl -pi.bak -e 's/old-string/new-string/g' my-files-*.txt

To rename many files at once according to a pattern, use rename. (Or if you want something more general, my own tool repren may help.)

rename 's/\.bak$//' *.bak

Use shuf to shuffle or select random lines from a file.
Know sort's options. Know how keys work (-t and -k). In particular, watch out that you need to write -k1,1 to sort by only the first field; -k1 means sort according to the whole line.
Stable sort (sort -s) can be useful. For example, to sort first by field 2, then secondarily by field 1, you can use sort -k1,1 | sort -s -k2,2
If you ever need to write a tab literal in a command line in bash (e.g. for the -t argument to sort), press Ctrl-V <tab> or write $'\t' (the latter is better as you can copy/paste it).
For binary files, use hd for simple hex dumps and bvi for binary editing.
Also for binary files, strings (plus grep, etc.) lets you find bits of text.
To convert text encodings, try iconv. Or uconv for more advanced use; it supports some advanced Unicode things. For example, this command lowercases and removes all accents (by expanding and dropping them):

uconv -f utf-8 -t utf-8 -x '::Any-Lower; ::Any-NFD; [:Nonspacing Mark:] >; ::Any-NFC; ' < input.txt > output.txt

To split files into pieces, see split (to split by size) and csplit (to split by a pattern).

System debugging

For web debugging, curl and curl -I are handy, and/or their wget equivalents.
To know disk/cpu/network status, use iostat, netstat, top (or the better htop), and (especially) dstat. Good for getting a quick idea of what's happening on a system.
To know memory status, run and understand the output of free and vmstat. In particular, be aware the "cached" value is memory held by the Linux kernel as file cache, so effectively counts toward the "free" value.
Java system debugging is a different kettle of fish, but a simple trick on Sun's and some other JVMs is that you can run kill -3 <pid> and a full stack trace and heap summary (including generational garbage collection details, which can be highly informative) will be dumped to stderr/logs.
Use mtr as a better traceroute, to identify network issues.
For looking at why a disk is full, ncdu saves time over the usual commands like "du -sk *".
To find which socket or process is using bandwidth, try iftop or nethogs.
The ab tool (comes with Apache) is helpful for quick-and-dirty checking of web server performance. For more complex load testing, try siege.
For more serious network debugging, wireshark or tshark.
Know strace and ltrace. These can be helpful if a program is failing, hanging, or crashing, and you don't know why, or if you want to get a general idea of performance. Note the profiling option (-c), and the ability to attach to a running process (-p).
Know about ldd to check shared libraries etc.
Know how to connect to a running process with gdb and get its stack traces.
Use /proc. It's amazingly helpful sometimes when debugging live problems. Examples: /proc/cpuinfo, /proc/xxx/cwd, /proc/xxx/exe, /proc/xxx/fd/, /proc/xxx/smaps.
When debugging why something went wrong in the past, sar can be very helpful. It shows historic statistics on CPU, memory, network, etc.
For deeper systems and performance analyses, look at stap (systemtap) and perf.
Confirm what Linux distribution you're using (works on most distros): "lsb_release -a"
Use dmesg whenever something's acting really funny (it could be hardware or driver issues).

Wednesday, September 17, 2014

Excel Macro to apply data filter, autosize cols etc

I've been working with a lot of hadoop output files lately and inspecting the top results with Excel.

Consequently this is a useful Excel macro to prepare the data for browsing - Applies data filter for sorting, bold headings, freeze top row, autosize columns.

Good to map it to Ctrl + Shift + D as well.

Sub ApplyDataFilter()
'
' Excel 2013 ApplyDataFilter Macro
' Applies data filter, bold headings, freeze top row, autosize columns.
'
    Rows("1:1").Select
    Selection.Font.Bold = True
    With ActiveWindow
        .SplitColumn = 0
        .SplitRow = 1
    End With
    ActiveWindow.FreezePanes = True
    Cells.Select
    Selection.AutoFilter
    Range("A2").Select

    Cells.Select
    Cells.EntireColumn.AutoFit
End Sub

Tuesday, July 22, 2014

Import CSV file into Microsoft SQL Server via SQL command

I find this method way easier than messing with SQL Server Import/Export Wizard. Although it can be useful to have the wizard create the target table definition first.

CREATE TABLE [dbo].[TargetTable] (
    [Id] [int] NULL,
    [StringCol] [varchar](MAX) NULL,
    [IntCol] [int] NULL
    -- additional columns here
) ON [PRIMARY]
GO

BULK INSERT [dbo].[TargetTable]
    FROM 'C:\pathto\SourceFile.csv'
    WITH
    (
        FIRSTROW = 2,             -- 2 if there are headers
        FIELDTERMINATOR = ',',    -- field delimiter
        ROWTERMINATOR = '\n',     -- next row
        ERRORFILE = 'C:\pathto\_ImportErrors.txt',
        TABLOCK
    )
GO

Monday, July 21, 2014

Background download of wikipedia dump on Linux

A handy way to download a wikipedia dump in the background in such a way that it continues after ending a terminal session:

$ nohup wget -bqc http://dumps.wikimedia.org/enwiki/20140707/enwiki-20140707-pages-articles-multistream.xml.bz2

Latest wiki dump url can be found here:
http://dumps.wikimedia.org/backup-index.html

References:
http://www.cyberciti.biz/tips/nohup-execute-commands-after-you-exit-from-a-shell-prompt.html
http://www.cyberciti.biz/tips/linux-wget-your-ultimate-command-line-downloader.html

Tuesday, July 15, 2014

Setting up an FTP server on CentOS 6.x

Need FTP access on a CentOS machine? Easy.

sudo yum install vsftpd
sudo service vsftpd start
sudo chkconfig vsftpd on
chkconfig --list

sudo vi /etc/vsftpd/vsftpd.conf # edit default conf file to have these lines:
    anonymous_enable=NO
    chroot_local_user=YES
    chroot_list_enable=NO
    chroot_list_file=/etc/vsftpd/chroot_list
    pasv_min_port=3000
    pasv_max_port=3050

sudo touch /etc/vsftpd/chroot_list
sudo /sbin/service vsftpd restart

Done.

Note that these steps have been grabbed from this Rackspace article that includes much more detail:
http://www.rackspace.com/knowledge_center/article/rackspace-cloud-essentials-centos-installing-vsftpd

Tuesday, June 10, 2014

New Eclipse workspace settings

Each new Eclipse workspace has the default settings, arg! I'm going to collect my customisations here.

Window > Preferences...

General > Editors > Text Editors > Spelling | Turn off

General > Editors > Text Editors | Show line numbers

General > Keys | Next Editor bound to Ctrl-Tab; Prev Editor bound to Ctrl-Shift-Tab

Need to check out:
https://code.google.com/a/eclipselabs.org/p/workspacemechanic/
http://mcuoneclipse.com/2012/04/04/copy-my-workspace-settings/

Wednesday, March 12, 2014

Setting up Hive with HBase external storage on CDH

If you have existing HBase tables it can be very handy to create Hive external tables wrapping these so that you can run HiveQL queries.

The following HiveQL will create the metastore table schema on top of MyTableName:

CREATE EXTERNAL TABLE MyTableName(key string, Column1 string, Column2 string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:Column1,cf:Column2")
TBLPROPERTIES ('hbase.table.name' = 'MyTableName');

The next step is to add the hbase auxlib jars to hive-site.xml to ensure certain HiveQL queries will run (e.g. select count(1) from MyTableName):
<property>
<name>hive.aux.jars.path</name>
<value>file:///opt/cloudera/parcels/CDH/lib/hive/lib/zookeeper.jar,file:///opt/cloudera/parcels/CDH/lib/hive/lib/hbase.jar,file:///opt/cloudera/parcels/CDH/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.6.0.jar,file:///opt/cloudera/parcels/CDH/lib/hive/lib/guava-11.0.2.jar</value>
</property>

This property must be added to the hive1 service > Config > Service-Wide > Advanced > Hive Service Configuration Safety Valve for hive-site.xml section to be able to execute certain HiveQL commands from the host command line.

To execute HiveQL from the Hue HiveUI add it to the hue1 service > Config > Beeswax Server (Default) > Advanced > Hive Configuration Safety Valve section.

Refs:
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
http://www.confusedcoders.com/bigdata/hive/hbase-hive-integration-querying-hbase-via-hive
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_18_10.html
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/E8GfiwMOIPw

Sunday, March 9, 2014

Free disk space email alert on Windows Server

Schedule the following PowerShell script to run periodically to receive email alerts on low disk space:

# Schedule with: %windir%\SysWOW64\WindowsPowerShell\v1.0\powershell.exe -WindowStyle "Hidden" -File C:\path_to_scripts\freedisk_space_alert.ps1

$minGbThreshold = 10;

$computers = "localhost"
$smtpAddress = "localhost";
$toAddress = "someone@test.com";
$fromAddress = "someone@test.com";

foreach($computer in $computers)
{
    $disks = Get-WmiObject -ComputerName $computer -Class Win32_LogicalDisk -Filter "DriveType = 3";
    $computer = $computer.toupper();
    $deviceID = $disk.DeviceID;
    foreach($disk in $disks)
    {
        $freeSpaceGB = [Math]::Round([float]$disk.FreeSpace / 1073741824, 2);
        if($freeSpaceGB -lt $minGbThreshold)
        {
            $smtp = New-Object Net.Mail.SmtpClient($smtpAddress)
            $msg = New-Object Net.Mail.MailMessage
            $msg.To.Add($toAddress)
            $msg.From = $fromAddress
            $msg.Subject = "Diskspace below threshold: " + $computer + "\" + $disk.DeviceId
            $msg.Body = $computer + "\" + $disk.DeviceId + " " + $freeSpaceGB + "GB Remaining";

            $smtp.UseDefaultCredentials = $false;
            $cred = New-Object System.Net.NetworkCredential("\smtpuser", "<enter password>"); # Ensure smtpuser is authorised to send emails
            $smtp.Credentials = $cred;

            $smtp.Send($msg)
        }
    }
}

This script is slightly modified from the one found here to include credentials:
http://gavindraper.com/2012/09/22/automatic-low-hard-disk-alerts-for-windows-server/
Thanks Gavin!

Also you may need to enable script execution in PowerShell by running the following:
%windir%\SysWOW64\WindowsPowerShell\v1.0\powershell.exe set-executionpolicy remotesigned
Which will still restrict internet downloaded scripts and is better than setting this value to unrestricted
See: http://superuser.com/questions/106360/how-to-enable-execution-of-powershell-scripts

Thursday, February 27, 2014

Efficient uploading of Jar files to your Hadoop cluster

Copying fat Jar files up to your Hadoop cluster to execute jobs on production-sized data sets in order to find bottlenecks can be painful when you want a quick turn around whilst debugging.

Sometimes local mode just doesn't cut it.

A good solution is to use rsync which supports incremental checking to only transfer file differences.

Command:
rsync -avz /your_source_directory/somejob-0.0.1.jar login@servername:/target_directory/somejob-0.0.1.jar

Options:
-a archive mode
-v verbose mode
-z compress file data during the transfer

Sunday, February 23, 2014

Solution to Maven solr-core artifact causes Eclipse error: "Missing artifact jdk.tools:jdk.tools:jar:1.6"

In Eclipse with a Maven project, if referencing artifact solr-core (v4.x) Eclipse can return a Maven Dependency Problem "Missing artifact jdk.tools:jdk.tools:jar:1.6".

This also causes a build path problem: The container 'Maven Dependencies' references non existing library 'C:\Users\<userdir>\.m2\repository\jdk\tools\jdk.tools\1.6\jdk.tools-1.6.jar'

The tools jar file is supplied by the JDK so we can exclude it in the pom.xml by adding an <exclusions> section like so:

<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-core</artifactId>
    <version>4.5.1</version>
    <exclusions>
        <exclusion>
            <artifactId>jdk.tools</artifactId>
            <groupId>jdk.tools</groupId>
        </exclusion>
    </exclusions>
</dependency>

Does anyone know if there is a better way to resolve this issue?

Tuesday, January 28, 2014

Export SQL Server table to csv file with headers using bcp

Ref: http://pastebin.com/x8Kk4Dn9
Ref: http://stackoverflow.com/questions/1355876/export-table-to-file-with-column-headers-column-names-using-the-bcp-utility-an/9754485#9754485

I use a method that outputs one file for the column headers read from INFORMATION_SCHEMA.COLUMNS and then appends a second file with the table data, both of which are generated using BCP.

Here is the batch file that creates TableData.csv, just replace the environment variables at the top.

Note that if you need to supply credentials, replace the -T option with -U my_username -P my_password

set BCP_EXPORT_SERVER=put_my_server_name_here
set BCP_EXPORT_DB=put_my_db_name_here
set BCP_EXPORT_TABLE=put_my_table_name_here

BCP "DECLARE @colnames VARCHAR(max);SELECT @colnames = COALESCE(@colnames + ',', '') + column_name from %BCP_EXPORT_DB%.INFORMATION_SCHEMA.COLUMNS where TABLE_NAME='%BCP_EXPORT_TABLE%'; select @colnames;" queryout HeadersOnly.csv -c -T -S%BCP_EXPORT_SERVER%

BCP %BCP_EXPORT_DB%.dbo.%BCP_EXPORT_TABLE% out TableDataWithoutHeaders.csv -c -t, -T -S%BCP_EXPORT_SERVER%

set BCP_EXPORT_SERVER=
set BCP_EXPORT_DB=
set BCP_EXPORT_TABLE=

copy /b HeadersOnly.csv+TableDataWithoutHeaders.csv TableData.csv

del HeadersOnly.csv
del TableDataWithoutHeaders.csv

This method has the advantage of always having the column names in sync with the table by using INFORMATION_SCHEMA.COLUMNS. The downside is it's a bit messy and creates temporary files. Microsoft should really fix the bcp utility to support this.

It uses the row concatenation trick from Concatenate many rows into a single text string? combined with ideas from http://social.msdn.microsoft.com/forums/en-US/sqlgetstarted/thread/812b8eec-5b77-42a2-bd23-965558ece5b9/

Sunday, January 5, 2014

2 Ways to Count Rows in HBase

There are 2 ways to count the number of rows in an HBase table:

Run the following from the linux command line:
$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter MyTableName

Run the following from the hbase shell (which is accessible from the Hue shell):
> count 'MyTableName'