Comment on how R handles data. Pls be very concise and use examples to elaborate your remarks.
Sampling
If data is too big to be analyzed in complete, its’ size can be reduced by sampling. Naturally, the question arises whether sampling decreases the performance of a model significantly. Much data is of course always better than little data. But according to Hadley Wickham’s useR! talk, sample based model building is acceptable, at least if the size of data crosses the one billion record threshold.
If sampling can be avoided it is recommendable to use another Big Data strategy. But if for whatever reason sampling is necessary, it still can lead to satisfying models, especially if the sample is
Bigger hardware
R keeps all objects in memory. This can become a problem if the data gets large. One of the easiest ways to deal with Big Data in R is simply to increase the machine’s memory. Today, R can address 8 TB of RAM if it runs on 64-bit machines. That is in many situations a sufficient improvement compared to about 2 GB addressable RAM on 32-bit machines.
Store objects on hard disc and analyze it chunkwise
As an alternative, there are packages available that avoid storing data in memory. Instead, objects are stored on hard disc and analyzed chunkwise. As a side effect, the chunking also leads naturally to parallelization, if the algorithms allow parallel analysis of the chunks in principle. A downside of this strategy is that only those algorithms (and R functions in general) can be performed that are explicitly designed to deal with hard disc specific datatypes.
“ff” and “ffbase” are probably the most famous CRAN packages following this principle. Revolution R Enterprise, as a commercial product, uses this strategy with their popular “scaleR” package as well. Compared to ff and ffbase, Revolution scaleR offers a wider range and faster growth of analytic functions. For instance, the Random Forest algorithm has recently been added to the scaleR function set, which is not yet available in ffbase.
Integration of higher performing programming languages like C++ or Java
The integration of high performance programming languages is another alternative. Small parts of the program are moved from R to another language to avoid bottlenecks and performance expensive procedures. The aim is to balance R’s more elegant way to deal with data on the one hand and the higher performance of other languages on the other hand.
The outsourcing of code chunks from R to another language can easily be hidden in functions. In this case, proficiency in other programming languages is mandatory for the developers, but not for the users of these functions.
rJava, a connection package of R and Java, is an example of this kind. Many R-packages take advantage of it, mostly invisible for the users. Rcpp, the integration of C++ and R, has gained some attention recently as Dirk Eddelbuettel has published his book “Seamless R and C++ Integration with Rcpp” in the popular Springer series “UseR!”. In addition, Hadley Wickham has added a chapter on Rcpp in his book “Advanced R development”, which will be published early 2014. It is relatively easy to outsource code from R to C++ with Rcpp. A basic understanding of C++ is sufficient to make use of it.
Alternative interpreters
A relatively new direction to deal with Big Data in R is to use alternative interpreters. The first one that became popular to a bigger audience was pqR (pretty quick R). Duncon Murdoc from the R-Core team preannounced that pqR’s suggestions for improvements shall be integrated into the core of R in one of the next versions.
Another very ambitioned Open-Source project is Renjin. Renjin reimplements the R interpreter in Java, so it can run on the Java Virtual Machine (JVM). This may sound like a Sisyphean task but it is progressing astonishingly fast. A major milestone in the development of Renjin is scheduled for the end of 2013.
Tibco created a C++ based interpreter called TERR. Beside the language, TERR differs from Renjin in the way how object references are modeled. TERR is available for free for scientific and testing purposes. Enterprises have to purchase a licensed version if they use TERR in production mode.
Another alternative R-interpreter is offered by Oracle. Oracle R uses Intel’s mathematic library and therefore achieves a higher performance without changing R’s core. Besides from the interpreter which is free to use, Oracle offers Oracle R Enterprise, a component of Oracles “Advanced analytic” database option. It allows to run any R code on the database server and has a rich set of functions that are optimized for high performance in-database computation. Those optimized function cover – beside data management operations and traditional statistic tasks – a wide range of data-mining algorithms like SVM, Neural Networks, Decision Trees etc.
Conclusion
A couple of years ago, R had the reputation of not being able to handle Big Data at all – and it probably still has for users sticking on other statistical software. However, today are a number of quite different Big Data approaches available. Which one fits best depends on the specifics of the given problem. There is not one solution for all problems. But there is some solution for any problem.
Comment on how R handles data. Pls be very concise and use examples to elaborate your...
How can the socioeconomic level affect health? Give examples of health problem to elaborate your answer?
can someone explain how we use frequency and wavelength to discover the radiation pls with examples as well thank you and when we know what formula we are supposed to use.
Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income. Describe the types of situations that produce sparse or dense data cubes. Illustrate with examples other than those used in the book. How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest?
Make up your own very simple research question (similar to the examples we've done in class). Make sure that this is your own* idea. a. Describe your research question/study b. State your independent variable C. State your dependent variable d. Describe the statistical test you would use to analyze your data and why you chose this test.
Use R language pls
• Create a list containing 3 vectors: your favorite food (at least 4), your expected grade at the two midterms and the final exam (1 to 100), your date of birth (feel free to lie). • name the vectors: food, grades, DOB • Create a new list just with the grades and the DOB. . Using the previous list, create a new list just with the grades and the food. • Compute the median and the...
Pls give your answer in paragraphs with appropriate examples where necessary. And make sure to give at least 400 words or more with probably some references. Thanks What has been the trend in infant mortality rates in the US in the last 30 years? What is the current rate? How does it compare with that of the other industrial countries?
Cessation of Smoking Identify your chosen problem and elaborate why it is a health concern. Identify how you will assess the need for the program. What type of data will you review? How will you determine the target audience? Write up a brief problem statement that puts the data you gather together as a problem, which is of concern for your target population. This is part of the first step in the health communication planning model.
1. What is the digital revolution and how has it impacted your everyday activities? 2. How is digital data represented? 3. What are some advantages of digital data representation? 4. What is data processing? 5. List and elaborate on the phases of the digital revolution. 6. What is convergence and what role does it play in cloud computing? 7. How are centralized computing and cloud computing similar? 8. Explain in your own words “Internet of Things” and list examples of...
Provide examples from your personal life for each of these three scenarios: •You acted rationally. Very briefly explain your choice. •You confronted bounded rationality. How did you resolve this situation? •You had to choose between self-regarding and other-regarding preferences. Briefly explain.
.How do you use statistical analysis in your work or even home life? Please give examples of how you use it or, if you do not think you use it, how it is used in your work place.