Free Data Science eBooks - January 2018
Now that Christmas and the New Year are behind us the nights are becoming a little longer with each passing day. Nevertheless, there's still loads of cold winter nights left to endure (unless you're in the Southern Hemisphere, in which case - throw me a shrimp on the barbie!).
It's time to dust off your New Year resolutions from last year (remember those?) and get ready to learn some new data skills.
Here are three free eBooks to help you on that journey and make those long nights just that bit shorter.
I hope these books prove to be a valuable resource to you and that you will visit regularly (and share with your friends in social media too).
If you haven't subscribed to our newsletter yet, why not subscribe using the form on the right - you'll be the very first to know when new resources are published.
This month we highlight 3 books:
- Data-Intensive Text Processing with MapReduce
- Programming Pig
- Test-Driven Development With Python
They're all FREE, so help yourselves...
by Jimmy Lin and Chris Dyer
Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever.
MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance.
This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains.
This book not only intends to help the reader "think in MapReduce", but also discusses limitations of the programming model as well.
Enjoying this blog post? Share it with the world...
by Alan Gates
This guide is an ideal learning tool and reference for Apache Pig, the open source engine for executing parallel data flows on Hadoop. With Pig, you can batch-process data without having to create a full-fledged application—making it easy for you to experiment with new datasets.
Programming Pig introduces new users to Pig, and provides experienced users with comprehensive coverage on key features such as the Pig Latin scripting language, the Grunt shell, and User Defined Functions (UDFs) for extending Pig. If you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig.
- Delve into Pig’s data model, including scalar and complex data types
- Write Pig Latin scripts to sort, group, join, project, and filter your data
- Use Grunt to work with the Hadoop Distributed File System (HDFS)
- Build complex data processing pipelines with Pig’s macros and modularity features
- Embed Pig Latin in Python for iterative processing and other advanced tasks
- Create your own load and store functions to handle data formats and storage mechanisms
- Get performance tips for running scripts on Hadoop clusters in less time
by Harry Percival
By taking you through the development of a real web application from beginning to end, the second edition of this hands-on guide demonstrates the practical advantages of test-driven development (TDD) with Python.
You'll learn how to write and run tests before building each part of your app, and then develop the minimum amount of code required to pass those tests. The result? Clean code that works. In the process, you'll learn the basics of Django, Selenium, Git, jQuery, and Mock, along with current web development techniques.
If you're ready to take your Python skills to the next level, this book - updated for Python 3.6 - clearly demonstrates how TDD encourages simple designs and inspires confidence.
- Dive into the TDD workflow, including the unit test/code cycle and refactoring
- Use unit tests for classes and functions, and functional tests for user interactions within the browser
- Learn when and how to use mock objects, and the pros and cons of isolated vs. integrated tests
- Test and automate your deployments with a staging server
- Apply tests to the third-party plugins you integrate into your site
- Run tests automatically by using a Continuous Integration environment
- Use TDD to build a REST API with a front-end Ajax interface
If you're interested in learning more about the content in this blog post we've sought out the best blogs, books, video courses and other stuff from around the internet for you. Some may be free while others may not, and to help you decide we use the following ratings:
- FREE content
- costs less than 10 £/$/Euro
- costs less than 50 £/$/Euro
- costs less than 100 £/$/Euro
- costs more than 100 £/$/Euro
Disclosure: some of these resources may be affiliate links, and we may earn an affiliate commission for purchases you make when using these links
You can find further details in our TCs
Practical Data Cleaning - 19 Essential Tips to Scrub Your Dirty Data
It's always difficult knowing where to start, but especially so when it comes to Data Science. No need to fret, though - we've selected our top 21 books that all aspiring data scientists should read.
These will get you going in no time...
Correlation and Causation - The Trouble With Story Telling
How many times have you heard that ‘correlation does not imply causation’? Lots, but I bet you didn't know that there are five reasons why you should not trust your intuition. This book gives you the tools to discover the five traps that even experienced investigators fall into.
Videos & Video Courses
4 hour Udemy Video Course delivered with animated videos. Perfect for beginners and will help get you started with basic statistical concepts
7 hour Udemy Video Course. Great for those needing a more business-oriented introduction to stats. Better still, the course even comes with homework. Yay!
9 hour Udemy Video Course. This is one of the top stats courses at Udemy and is a must-see for those that need to learn stats in R
CorrelViz - visualise all the correlations in your data in minutes
CorrelViz is completely automated and gives you the Story of Your Data in minutes, with one click - saving you months of manual analysis and shed-loads of cash!
Analyse all your data, discover all the correlations you seek - and some you never even dreamed of...
blog comments powered by Disqus