Tuesday 15 April 2014

OS X tip of the day.. Mounting an ntfs drive

So, your friend lends you a hard drive with a collection of movies, music, pics  etc.. you plug it into your mac and nothing happens... the diskutil GUI shows the disk but you can't see it in finder! Chances are you have been give a disk that has been formatted using the NTFS file system.  Damn! you say, but fret not. Here is a quick and easy method to get access to that NTFS formatted drive.

1. Open up a terminal - command+space and type terminal
2. At the terminal, enter the following command: diskutil list

You will see output that looks similar to the following:

diskutil list
/dev/disk0
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *320.1 GB   disk0
   1:                        EFI EFI                     209.7 MB   disk0s1
   2:                  Apple_HFS Macintosh HD            319.2 GB   disk0s2
   3:                 Apple_Boot Recovery HD             650.0 MB   disk0s3
/dev/disk1
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:     FDisk_partition_scheme                        *3.0 TB     disk1
   1:               Windows_NTFS VIDEO2                  3.0 TB     disk1s1



The highlighted output is what is of interest i.e. the one with the NTFS.  In particualr we are interested in the in the IDENTIFIER disk1s1

3. Create a directory which will serve as the mounting point for your NTFS drive e.g: mkdir /Volumes/my_ntfs_disk

4. Now using the identifier we discovered earlier, we can mount the NTFS drive like so: sudo mount -t ntfs /dev/disk1s1 /Volumes/my_ntfs_disk

5. Check finder and you will now see the NTFS drive visible and fully operational. 

Thursday 10 April 2014

A fundamental analysis of Hadoop

What is Hadoop?

In simple terms, Hadoop is a tool used for storage and processing of large scale data sets on clusters of commodity hardware.  It was developed by two engineers at Yahoo and is now part of the Apache open source license.  

This post  here is intended to give the you at best, a very basic overview of the key components of Hadoop. It is by no means and in depth analysis of Hadoop. I myself have only just opened the door into the world of big data and I'm simply sharing what I have discovered so far.  

Components of Hadoop

Hadoop consists of two key components.  HDFS (Hadoop Distributed File System) and MapReduce and these are part of a Hadoop cluster. The Hadoop cluster is a set of machines that are networked together. Each machine in the cluster is a node, each host runs one or more Hadoop services and finally they can be classified as master or slave.  MapReduce and HDFS run as services.

Let's take a closer look at HDFS, MapReduce. 

HDFS

The job of HDFS is to split files across multiple hosts. The default block size files are split into is 64MB, however cloudera which is the cluster manager for Hadoop, recommends changing the block size to 128MB.  Blocks are replicated to hosts throughout the cluster. The default is 3.  Replication increases reliability and performance. Data can tolerate the loss of all but one replicate and it offers more opportunity for data locality.

MapReduce

A programing model or framework that is neither platform nor language specifi. written in Java,  it's sole purpose is record oriented data processing that uses key-value pairs to move data back and forth. This  facilitates task distribution across multiple hosts and where possible each host processes data that it has stored.  MapReduce has two phases - the Map phase which is where the data is created and the Reduce phase which involves the processing of the data. In between these phases exists a shuffle and sort phase which simply sends data from mapper to reducer. The full internals of MapReduce are beyond the scope of this post. I intend to write up a more indepth post after I have had time to study its internals. 

The HDFS Service Instances

HDSFS requires 3 service instances to run.  NameNode, a secondary NameNode (also a master) and DataNodes. Let's take a closer look at each of these components. 

Master NameNode

Stores metadata information about file loations in HDFs, the ownership and permission of the files, the names of the individual blocks as well as the location of those blocks. All this is stored in a file called "fsimage". Metadata is stored on disk and read when the NameNode daemon starts, this daemon runs on the master host. Block locations are not stored on disk, they are reported by the DataNodes during startup. Changes to metadata are made in RAM and they are also written to a log file on disk called "edits"

Secondary NameNode

Firstly this is not a fail over node.  It performs memory intensive administrative functions for the Master NameNode. It acts as a housekeeper whereby it periodically combines a prior file system snapshot and edit log into a new snapshot, that is then transmitted back to the master. the medata changes are written to an edit log. In a large installation it should run on a separate host and have as much ram as the master.

Data Nodes /Slave Hosts

These store the actual content of the files and they contain the blocks of the original file. Each block is given the name Blk_xxxx. The slave doesn't need to know the underlying file the block is part of, that information is handled by the NameNode. For greater redundancy the blocks are replicated across multiple slave hosts. Default is 3 replicates and  they communicate with the  NameNode.

MapReducer service instances

The MapReducer has a single master service instance called  the JobTracker who job is to parcel out work to the slave hosts. A task tracker runs on the slave hosts and it's job is to take care of the MapReduce jobs it has been assigned.

Conclusions

To summarise:

Hadoop is a framework for processing of large scale data
Cloudera acts as a cluster manager for Hadoop Services
Hadoop is made up of HDFS which is the Java written large data file system that has 3 service instances ( Master NameNode, secondary NameNode and DataNode) and MapReduce which is a framework that is used for processing the data in key value pairs.

As I said earlier, this is just a basic overview of Hadoop and I will expand on these components in the near future.  For now though, I hope you have gained some basic idea of what Hadoop is. Thanks for reading and feel free to comment or ask me questions in the comment box below.























Tuesday 1 April 2014

An Introduction to Digital Analytics


 What is it?

Digital Analytics or to quote Avinash Kaushik:

The analysis of qualitative and quantitative data from your business and the competition to drive a continual improvement of the online experience that your customers and potential customers have which translates to your desired outcomes (online and offline) 

With massive growth in connected devices (smart phones, tablets etc)  the standard  purchasing funnel: Awareness > Acquisition > Engagement >Conversion>Retention has become all but obsolete. Customers can now begin their buying journey at any stage in the funnel model. What digital analytics provides is the means to understand customer trends, behaviours etc and use that data  for qualitative and quantitative analysis, to best define how to drive customers towards an end goal or "Conversion" i.e in an ecommerce website this would be a purchase.

It's A Continuous Cycle

To build a solid analytics infrastructure, one needs to understand that "Good data" is the governing foundation behind smart decisions. To effectively build this infrastructure we need time, effort, people, processes and technology. We need an analytics team that is comprised of people who understand the business objectives, understand what analytics is and finally the technical competency to implement an analytics tool.

The following processes have been defined to help establish this infrastructure 




 Let's take a look at each  process  more closely.

Define A Measurement Plan

A five stage model was devised by Google Analytics guru Avinash Kaushik,  it provides digital analysts a framework that can be used to create a business specific measurement plan. Let's take a closer look at the five stages:


1. Document business objectives - Why do we exist? What is our purpose?

2. Identify strategies and tactics - With the business objective clearly defined, what strategies and tactics can you utilise to fully leverage the business objectives.

3. Key performance indicators (KPI)  - How is my business performing based on the criteria obtained at stage 2.  KPI's are numbers that obtained through measurement of strategies and tactics that you the business owner needs to look at day-to-day to understand how your business is performing

4. Segments - With your KPIs defined you need to decide which segments of data are important to measure -  For example in a travel agency website we can segment  our KPI's by Sales: Number of sales daily, weekly, monthly, yearly, by Customer: Returning vs old which can help you identify potential customers that maybe eligible for special offers.  The segments one chooses is entirely dependent on the nature of your business and what strategies and tactics you are using.

5.Targets - The final stage of the measurement plan, what are the targets defined for each of your KPIs.  This helps anyone who looks at the data understand whether the business is successful or unsuccessful


Document Technical Infrastructure

With the measurement plan clearly defined the next stage in the cycle is to understand your current technical environment by documenting your technical infrastructure. Some questions that might be asked are:

What are our server technologies - can they cope with the demands of our measurement plan?
Are we leveraging mobile technology?
Are we using responsive web design?
Can we track everything we need to track with the technology we are using? 

Create Implementation plan 

At this stage, based on the business needs and the technical infrastructure of your business you can now begin to create an implementation plan specific to the analytics tool that you have chosen to use. For example with Google analytics you would need to define the custom code snippets and specific product features required to track the data as defined in the measurement plan.

Implement The Plan

Here, the web development team and mobile team (if required) will bring into fruition your implementation plan.

Maintain And Refine

Due to the ever changing digital world, your measurement plan needs to be constantly maintained and refined, to ensure that your data evolves with the business. Thus, the measurement planning process becomes a continuous cycle.

In the next post, we shall look closely at one of  the industry leaders in digital analytics tools: Google Analytics.


Acknowledgements: Google Analytics Academy and Avinash Kaushik  ( http://www.kaushik.net/avinash)